# Optimizing the Python Code for Big Data 
Balancing Coding Complexity against Computational Complexity 
    
    AUTHOR: Dr. Roy Jafari 


# Chapter 2: Choosing the right data types 

## Challenge 1: String or Categories? 

At the end of this challenge, we will have gained experience on when and where we should use a category instead of a string, or vice versa. Furthermore, during this challenge, you will get to understand the difference between ordinal categories and non-ordinal ones.

Use the following prompts to complete this challenge.

1.	Use the following code to create the pandas DataFrame `df_string`, and study the printout of `df_string.info()`. Note the Dtyoe of the column performance, and also note how much space `df_string` takes up. 

In [1]:
import pandas as pd
import numpy as np
df_string = pd.DataFrame(
    np.random.choice(
    ['Underperforming', 'Meets Expectations', 
     'Exceeds Expectations'],
    10**6),
    columns = ['Performance']
)
df_string.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 1 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   Performance  1000000 non-null  object
dtypes: object(1)
memory usage: 7.6+ MB


**Answer**: Dtype of the column Performance is listed as object. 

The DataFrame `df_string` uses more than 7.6 MB which is more than 7600 KB.

2.	Use the following code to create the pandas DataFrame `df_category`, and study the printout of `df_category.info()`. Note the Dtyoe of the column performance, and also note how much space `df_category` takes up. 

In [2]:
df_category = pd.DataFrame(
    np.random.choice(
    ['Underperforming', 'Meets Expectations', 
     'Exceeds Expectations'],
    10**6),
    columns = ['Performance'],
    dtype='category'
)
df_category.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 1 columns):
 #   Column       Non-Null Count    Dtype   
---  ------       --------------    -----   
 0   Performance  1000000 non-null  category
dtypes: category(1)
memory usage: 976.8 KB


**Answer**: Dtype of the column Performance is listed as category. 

The DataFrame `df_colum` uses 976.8 KB.

3.	What is the difference between the Dtype of `df_category` and `df_string`? Weren’t you expecting to see a string for `df_string`? What’s happening? Google to find an answer. 

**answer**:
Yes. While `df_category`'s datatype is category, `df_string`'s datatype is listed as object.

One would expect to see string to be listed for `df_string`, however, object is listed. After Googling **pandas distinction between str and object types** we will realize that object is a general name pandas uses for any datatype that is not defined by itself or by numpy. For string pandas uses the python string objects and that's why it is listed as object, otherwise, it is just string.


4.	What is the difference between the memory sizes that `df_category` and `df_string` use? Which one uses significantly more, and how can you explain the big difference between the two DataFrame while they are essentially containing the same data?

**Answer**:
There is a significant difference. `df_string` uses more than 7600 KB and `df_category` uses 976.8 KB. In other words, `df_string` uses 7 to 8 times (7600 / 976.8) more than `df_category`. 

The reason is due to the way the data is being encoded. For encoding all of the performance records, one million values, as string, the CPU must allocate at least 20*4 bytes of RAM for each value. While to encode the same data as category, pandas will only need 3*20*4 bytes for the categories, then need one million int8 (one byte) to refer to one of these three categories.

5.	Run the following code, and note how long it takes for your computer to perform a common analytic data manipulation – grouping by – for both `df_category` and `df_string`. Redo the run a few times to make sure you are seeing a pattern and not a noise.

In [3]:
import time
t0 = time.time()
df_string.groupby('Performance').size()
print(f'''groupby time for df_string: {
        time.time()-t0:.4f}s''')

t0 = time.time()
df_category.groupby('Performance').size()
print(f'''groupby time for df_category: {
        time.time()-t0:.4f}s''')

groupby time for df_string: 0.2508s
groupby time for df_category: 0.0166s


6.	Is there a significant difference between the run time for `df_category` and `df_string`. If yes, what do you think is the reason for the significant difference? *Hint: when there is more computational complexity the reason is that the CPU has to work harder, and the question is to find the reason why the CPU has to work harder.* 

**Answer**: Yes, there is a significant difference. The same code runs about 10 times faster for `df_category`.  

At the core of what needs to be done for grouping by the rows is to compare the values of the performance. When doing this for `df_string` CPU must compare two larger items every time as each string take ups 20*4 bytes, while when doing this for `df_category` the comparison must be done for smaller items as each int8 takes up only 1 byte.  

7.	Study the following code. It attempts to filter `df_string`, using a boolean mask `BM`. Run the following code, and study its output. Does the code make logical sense? Does the output make logical sense?

In [4]:
BM = df_string.Performance >= 'Meets Expectations'
print(df_string[BM].value_counts())

Performance       
Underperforming       333747
Meets Expectations    332532
dtype: int64


**Answer**:  The code may not make logical sense; it depends on what we wanted to get done. What does it mean for two different strings to be larger than one another? For python, it means what characters come first in regard to an agreed-on alphabetical order. For instance, if we run `'B' > 'A'` it will evaluate as true, but not `'B' < 'A'`. So if we are expecting for CPU to automatically understand we are looking for Performances that are either 'Meets Expectations' or 'Exceeds Expectations' the code does not make logical sense. 

However, the output makes logical sense, as based on alphabetical order we should only get 'Underperforming' and 'Meets Expectations’.


In [5]:
'B' > 'A'

True

In [6]:
'B' < 'A'

False

8.	The following code tries to do the same filtering as done in the previous step, this time for `df_category`, but it returns an error. Run the code, study the error and figure out why the error happens. 

In [7]:
BM = df_category.Performance >= 'Meets Expectations'
print(df_category[BM].value_counts())

TypeError: Unordered Categoricals can only compare equality or not

**Answer**: The first line of the code, `BM = df_category.Performance >= 'Meets Expectations'`, generates this error because pandas does not accept < or > operators on unordered categorical Dtype.

9.	The error message from the previous step is *‘Unordered Categoricals can only compare equality or not’* The following code creates an ordered categorical attribute in pandas DataFrame, study, and then run the code.

In [8]:
from pandas.api.types import CategoricalDtype
categories = ['Underperforming', 'Meets Expectations',
              'Exceeds Expectations']

cat_type = CategoricalDtype(
    categories=categories, ordered=True)

df_o_category = pd.DataFrame(
    np.random.choice(categories,10**6),
    columns = ['Performance'],
    dtype= cat_type
)
df_o_category.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 1 columns):
 #   Column       Non-Null Count    Dtype   
---  ------       --------------    -----   
 0   Performance  1000000 non-null  category
dtypes: category(1)
memory usage: 976.8 KB


10.	Compare the printout of `df_o_category.info()` from the preceding code, and that of `df_category.info()` from step 2. Is there any difference between the two?

**Answer**: No, the printouts are the same, therefore, the memory that both `df_o_category` and `df_category` uses are the same.

11.	The following code is the same as in step 8 but updated for `df_o_category` Run the following code and study its output. Does the code make logical sense? Does the output make logical sense?

In [9]:
BM = df_o_category.Performance >= 'Meets Expectations'
print(df_o_category[BM].value_counts())

Performance         
Meets Expectations      333944
Exceeds Expectations    333889
Underperforming              0
dtype: int64


**Answer**: Yes, both the code and the output make logical sense.

12.	From your experiences in steps 2 and 9, and also steps 8 and 11, in your own words explain the difference between pandas' unordered categorical Dtyoe and ordered categorical Dtype. What are the advantages and disadvantages of each? 

**Answer**: They are essentially the same, and also take up the same amount of memory space, but there are two important differences between them. First, the operators < and > are not accepted for unordered categorical data. Second, to create an ordered categorical datatype we have to be more intentional by defining the categories and the order between them.

*Advantages of unordered categorical Dtype*:
- Easier to code

*Disadvantages of unordered categorical Dtype*:
- not possible to use < or > operator

*Advantages of ordered categorical Dtype*:
- Have to specify the order

*Disdvantages of unordered categorical Dtype*:
- possible to use < or > operators

13.	The three datatypes that we worked with in this challenge are string, pandas unordered categorical and pandas ordered categorical. From your experience with this challenge, provide at least two situations for each that the data type is best to be used.

**Answers**:

*Situations for string*:
- The majority of the text values are unique.


*Situations for unordered categorical Dtype*:
- the majority of the text value is repeated, and there is no meaningful order between the possibilities.
- the majority of the text value is repeated, and we will not need to be performing analytics that needs the logical order between the possibilities.

*Situations for ordered categorical Dtype*:
- the majority of the text value is repeated, and there is a meaningful order between the possibilities.
- the majority of the text value is repeated, and we will need to be performing analytics that needs the logical order between the possibilities.
