# Optimizing the Python Code for Big Data 
Balancing Coding Complexity against Computational Complexity 
    
    AUTHOR: Dr. Roy Jafari 


# Chapter 2: Choosing the right data types 

## Challenge 1: String or Categories? 

At the end of this challenge, we will have gained experience on when and where we should use a category instead of a string, or vice versa. Furthermore, during this challenge, you will get to understand the difference between ordinal categories and non-ordinal ones.

Use the following prompts to complete this challenge.

1.	Use the following code to create the pandas DataFrame `df_string`, and study the printout of `df_string.info()`. Note the Dtyoe of the column performance, and also note how much space `df_string` takes up. 

```
import pandas as pd
import numpy as np
df_string = pd.DataFrame(
    np.random.choice(
    ['Underperforming', 'Meets Expectations', 
     'Exceeds Expectations'],
    10**6),
    columns = ['Performance']
)
df_string.info()
```

**Answer**:

2.	Use the following code to create the pandas DataFrame `df_category`, and study the printout of `df_category.info()`. Note the Dtyoe of the column performance, and also note how much space `df_category` takes up. 

```
df_category = pd.DataFrame(
    np.random.choice(
    ['Underperforming', 'Meets Expectations', 
     'Exceeds Expectations'],
    10**6),
    columns = ['Performance'],
    dtype='category'
)
df_category.info()
```

**Answer**: 


3.	What is the difference between the Dtype of `df_category` and `df_string`? Weren’t you expecting to see a string for `df_string`? What’s happening? Google to find an answer. 

**Answer**:

4.	What is the difference between the memory sizes that `df_category` and `df_string` use? Which one uses significantly more, and how can you explain the big difference between the two DataFrame while they are essentially containing the same data?

**Answer**:


5.	Run the following code, and note how long it takes for your computer to perform a common analytic data manipulation – grouping by – for both `df_category` and `df_string`. Redo the run a few times to make sure you are seeing a pattern and not a noise.

```
import time
t0 = time.time()
df_string.groupby('Performance').size()
print(f'''groupby time for df_string: {
        time.time()-t0:.4f}s''')

t0 = time.time()
df_category.groupby('Performance').size()
print(f'''groupby time for df_category: {
        time.time()-t0:.4f}s''')
```

**Answer**:

6.	Is there a significant difference between the run time for `df_category` and `df_string`. If yes, what do you think is the reason for the significant difference? *Hint: when there is more computational complexity the reason is that the CPU has to work harder, and the question is to find the reason why the CPU has to work harder.* 

**Answer**: 

7.	Study the following code. It attempts to filter `df_string`, using a boolean mask `BM`. Run the following code, and study its output. Does the code make logical sense? Does the output make logical sense?

```
BM = df_string.Performance >= 'Meets Expectations'
print(df_string[BM].value_counts())
```

**Answer**:  

8.	The following code tries to do the same filtering as done in the previous step, this time for `df_category`, but it returns an error. Run the code, study the error and figure out why the error happens. 

```
BM = df_category.Performance >= 'Meets Expectations'
print(df_category[BM].value_counts())
```

**Answer**: 


9.	The error message from the previous step is *‘Unordered Categoricals can only compare equality or not’* The following code creates an ordered categorical attribute in pandas DataFrame, study, and then run the code.

```
from pandas.api.types import CategoricalDtype
categories = ['Underperforming', 'Meets Expectations',
              'Exceeds Expectations']

cat_type = CategoricalDtype(
    categories=categories, ordered=True)

df_o_category = pd.DataFrame(
    np.random.choice(categories,10**6),
    columns = ['Performance'],
    dtype= cat_type
)
df_o_category.info()
```

**Answer**:

10.	Compare the printout of `df_o_category.info()` from the preceding code, and that of `df_category.info()` from step 2. Is there any difference between the two?

**Answer**: 

11.	The following code is the same as in step 8 but updated for `df_o_category` Run the following code and study its output. Does the code make logical sense? Does the output make logical sense?

```
BM = df_o_category.Performance >= 'Meets Expectations'
print(df_o_category[BM].value_counts())
```

**Answer**: 

12.	From your experiences in steps 2 and 9, and also steps 8 and 11, in your own words explain the difference between pandas' unordered categorical Dtyoe and ordered categorical Dtype. What are the advantages and disadvantages of each? 

**Answer**:

13.	The three datatypes that we worked with in this challenge are string, pandas unordered categorical and pandas ordered categorical. From your experience with this challenge, provide at least two situations for each that the data type is best to be used.

**Answers**:

