# Optimizing the Python Code for Big Data 
Balancing Coding Complexity against Computational Complexity 
    
    AUTHOR: Dr. Roy Jafari 


# Chapter 2: Choosing the right data types 


## Challenge 4: string or datetime?

After going through this challenge you will be able to intelligibly choose between string and DateTime when both are possible and will be able to elucidate the reasons for your choice from experience.  
Let us get started. Answer the following questions or complete the following steps.
1.	Using pandas read the file *orders.csv*, and study the ensuing DataFrame. The following code gets this done.

```
import pandas as pd
order_df = pd.read_csv('orders.csv')
print(order_df)
```

2.	Run the following code to figure out the data type of the column `date` in `order_df`. What is its data type?

```
order_df.info()
```

**Answer**: 


3.	The following code first changes the title of the column date into `date_str`. After that, the code uses `pd.to_datetime()` to create the new column `date_dt` that contains the same information but its data type is *DateTime*. Run the following code and study its output, specifically note the *Dtype* of `date_str` and `date_dt`. 

```
order_df  = order_df.rename(columns = {'date': 'date_str'})
order_df['date_dt'] = pd.to_datetime(order_df.date_str)
order_df.info()
```

**Answer**: 


4.	The following code will print out the exact number of bytes of memory that the columns `date_str` and `date_dt` use. Is there any difference in the amount of memory that the date information takes up when encoded using string or DateTime?

```
order_df.memory_usage()
```

**Answer**: 

5.	The following code draws a line plot of the column `quantity`. Run the code and investigate the ensuing plot. There is a logical error in the line plot. Investigate to see if you can figure it out.

```
import matplotlib.pyplot as plt
(order_df
 .set_index('date_dt')
 .quantity
 .plot())

(order_df
 .set_index('date_dt')
 .rolling(window=10)
 .quantity.mean()

 .plot())

plt.xticks(rotation=90)
plt.show()
```

**Answer**: 

6.	Running the following code will give you a hint to be able to answer the previous question. Carefully read the dates and their consecutive orders. Do you see anything out of ordinary? What do you think is causing the problem?

```
print(order_df[:20])
print(order_df[-20:])
```

**Answer**: 


7.	The problem is that there are some dates whose value of quantity must have been zero, and because of that those dates are not included in `order_df`. To fix our analysis we need to add those rows to `order_df`. In this step, we will be fixing the problem using date_str. In the next step, we will do the same thing using `date_dt`. In the step after that, we will compare the two approaches.

This is going to be a long step with multiple sub-steps.

First, we get a copy of `order_df` and only keep `date_str` and `quantity` as columns. The name of the new DataFrame will be `order_str_df`. Run the following code to get this done.

```
order_str_df = order_df.reset_index()[['date_str','quantity']].copy()
print(order_str_df)
```

Next, we will create the function `get_next_date()` that given any date will output its next date. The reason that we need such a function is that our calendar does not follow a completely uniform pattern and the number of days in each month is different. This has been captured in the following code using the dictionary `end_months`. The function `get_next_date()` leverages `end_months` to work. Run the following code to generate this function. 


```
end_months = {'01':31, '02':28, '03':31,
              '04':30, '05':31, '06':30,
              '07':31, '08':31, '09':30,
              '10':31, '11':30, '12':31}

def get_next_date(date):
    
    (year,month,day) = date.split('-')
    max_day = end_months[month]
    
    if max_day > int(day):
        new_date = f'0{int(day)+1}'[-2:]
        return f'{year}-{month}-{new_date}'
    else:
        if int(month) < 12:
            new_month = f'0{int(month)+1}'[-2:]
            return f'{year}-{new_month}-01'
        else:
            return f'{int(year)+1}-01-01'
    if max_day > int(day):
        new_date = f'0{int(day)+1}'[-2:]
        return f'{year}-{month}-{new_date}'
    else:
        if int(month) < 12:
            new_month = f'0{int(month)+1}'[-2:]
            return f'{year}-{new_month}-01'
        else:
            return f'{int(year)+1}-01-01'
```

After creating the function, give it a try by running it for a few dates, for instance, run `get_next_date('2022-12-31')`.

```
get_next_date('2022-12-31')
```

Next, we define another function; the function `dates_between()` is created to output the dates that are between two given dates. This function takes in two dates and outputs the dates of the days between the two input dates. The function `dates_between()` leverages the function `get_next_date()` to work. Run the following code that defines the function `dates_between()`.


```
def dates_between(date1,date2):
    if(date1==date2):
        return None
    else:
        output = []
        next_date = date1
        while get_next_date(next_date) != date2:
            next_date = get_next_date(next_date)
            output.append(next_date)
        return output
```

After creating the function, give it a try by running it for a few couples of dates, for instance, run `dates_between('2022-10-31','2022-11-05')`.


```
dates_between('2022-10-31','2022-11-05')
```

Now that we are armed with `dates_between()` function, we can run the following code that will identify the dates that are missing in `order_str_df`.

```
missing_dates = []
for i,row in order_str_df.iterrows():
    if i== 0:
        continue
    p_date = order_str_df.loc[i-1,'date_str']
    the_date = row.date_str
    missing_dates.extend(dates_between(p_date,the_date))
print(missing_dates)
```

Next, we will use the list `missing_dates` to add the missing dates to `order_str_df`. The following code gets this done.


```
my_index = pd.Series(missing_dates,name='date_str')
to_be_adde_df = pd.DataFrame(
    0.0,
    index=my_index,
    columns = ['quantity'])
order_str_df = pd.concat(
        [order_str_df,
         to_be_adde_df.reset_index()]
)
order_str_df = (order_str_df
    .sort_values('date_str')
    .reset_index(drop=True)
)
print(order_str_df)
```

After running the preceding code, `order_str_df` will not have the missing-dates issue of `order_df`. You can check this by running the following code and studying its output.

```
print(order_str_df[:20])
print(order_str_df[-20:])
```

Lastly, we can use `order_str_df` to create the correct version of the plot that we drew under step 5. Run the following code and compare the ensuing plots with the plot from step 5.

```
(order_str_df
 .set_index('date_str')
 .quantity.plot()
)
(order_str_df
 .set_index('date_str')
 .rolling(window=10)
 .quantity.mean()
 .plot()
)
plt.xticks(rotation=90)
plt.show()
```

After running all of the codes, write out the tasks that we had to perform to remedy the issue we found in step 5.

**Answer**: 


8.	In this step, we will redo what we did in step 7; we will remedy the issue we found in step 5; however, we will do it this time using `date_dt`. In other words, instead of working with string to manipulate dates, we will use the data type DateTime. This is going to be another long step with multiple sub-steps.vFirst, we get a copy of `order_df` and only keep `date_dt` and `quantity` as columns. The name of the new DataFrame will be `order_dt_df`. Run the following code to get this done.

```
order_dt_df = order_df.reset_index()[['date_dt','quantity']].copy()
print(order_dt_df)
```

Next, we will find the first and last date we see on `order_dt_df`. The following code gets this done.

```
first_date = order_dt_df.date_dt.min()
last_date = order_dt_df.date_dt.max()
print(first_date,last_date)
```

Next, we will find the number of days between the `first_date` and `last_date`. The following code gets this done.

```
n_days = (last_date - first_date).days
print(n_days)
```

Next, we will create a list of all the dates between `first_date` and `last_date`. The following code takes advantage of the function `datetime.time_delta()` to easily get this done.

```
import datetime
all_dates = [
    first_date + datetime.timedelta(days=i) for i in range(n_days+1)
]
print(all_dates)
```

Next, we will create `stage_df` which is a DataFrame with a row for all the dates in `all_dates`, and the column `quantity` with the value of zero.

```
stage_df = pd.DataFrame(0.0,
                        index = all_dates,
                        columns =['quantity'])
print(stage_df)
```

Next, we will use the `.update()` function which is a property of each pandas DataFrame to update `stage_df` using `order_dt_df`. Once `stage_df` is updated, we no longer need the old `order_dt_df`, so we equate `stage_df` to `order_dt_df`. 


```
stage_df.update(order_dt_df.set_index('date_dt'))
order_dt_df = stage_df
print(order_dt_df)
```

After running the preceding code, `order_dt_df` will not have the missing-dates issue of `order_df`. You can check this by running the following code and studying its output.

```
print(order_dt_df[:20])
print(order_dt_df[-20:])
```

Lastly, we can use `order_dt_df` to create the correct version of the plot that we drew under step 5. Run the following code and compare the ensuing plot with the plot from step 5.

```
order_dt_df.quantity.plot()
(order_dt_df
 .rolling(window=10)
 .quantity.mean()
 .plot()
)
plt.xticks(rotation=90)
plt.show()
```

After running all of the codes, write out the tasks that we had to perform to remedy the issue we found in step 5.

**Answer**: 


9.	The plots that we created under steps 7 and 8 are both correct, but there is a difference between them. Compare them to find the difference. Which one is better? What’s causing the serendipitous improvement? 

**Answer**: 


10.	Compare what we did in steps 7 and steps 8. In both steps, we were trying to remedy the missing-dates issue in `order_df`. In step 7, the date information was encoded in the data type string, and in step 8, they were encoded in DateTime. In which of the two steps did we have to create new functions? Which of the two steps would have been easier to develop in a real project?

**Answer**:
- **In which of the two steps did we have to create new functions?**: 

- **Which of the two steps would have been easier to develop in a real project?**: 


11.	As an option to encode and manipulate date information compare string and DateTime using the following criterion: 1) RAM space, 2) CPU Performance, 3) run time, 4) coding time.  

**Answer**: 

