Homework 5 : Dates and Times

In this assignment, we will study tables. We will get practice with panel data consisting of records with time-stamps. By learning to process dates and times, we can remove inconsistencies to determine interesting patterns in the data.   

The questions guide you step-by-step through these approaches. Please post the homework 5 channel of Slack with any questions. 


### Rubric

Question | Points
--- | ---
Question 1.1 | 1
Question 1.2 | 1
Question 1.3 | 1
Question 2.1 | 1
Question 2.2 | 1
Question 3.1 | 1
Question 4.1 | 1
Question 4.2 | 1
Question 4.3 | 1
Total | 9

We will study data from the stock market. The dataset has three columns.

<img src="image.PNG"  width="350"/>

The columns consist of

- stock price 
- volume of trading 
- date and time of transaction

However, the records have inconsistencies. Before we can look for trends in the price and volume, we must process the data particularly the time-stamps.  

### 0. Load Packages

We have been working with the

-  `numpy` package for manipulations of arrays
-  `matplotlib` package for generating charts
- `pandas` package for handling tables 

Here we will learn about the `datetime` and `pytz` packages to get experience with operations on dates and times.

In [None]:
# import some packages

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from datetime import datetime as dt
import pytz

# change some settings

pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 8)

plt.rcParams['figure.figsize'] = (10,8)

In [None]:
# TEST

import sys

assert "numpy" in sys.modules and "np" in locals()
assert "pandas" in sys.modules and "pd" in locals()
assert "matplotlib" in sys.modules and "plt" in locals()
assert "datetime" in sys.modules and "dt" in locals()

Note that we changed some of the default settings in `pandas` with `set_option` and in `matplotlib` with `rcParams`. 

### 1. Date and Time Data Types

The `datetime` package allows us to store an object containing year, month, and day.  

In [None]:
date = dt(2011, 1, 7)

print(date.year, date.month, date.day)

Note that `datetime` objects have their own data type.

In [None]:
type(date)

The `datetime` package has support for time in terms of hours, minutes, seconds, and microseconds.

In [None]:
date = dt(2011, 1, 7, 10, 25, 55, 590172)
print(date.hour, date.minute, date.second, date.microsecond)

#### Question 1.1

We can perform operations on `datetime` objects. If we compare two dates, then we can check the order of occurence. 

In [None]:
dt(2011, 1, 7) < dt(2012, 5, 15) 

If we use subtraction then we can calculate a range of dates and times.

In [None]:
datetime_range = dt(2011, 1, 7, 10, 25, 55, 590172) - dt(2008, 2, 10, 16, 25, 35, 690192)

print(datetime_range)

Note that the difference of `datetime` objects has its own data type.

In [None]:
type(datetime_range)

Use subtraction to determine the number of days between January 2, 2006 and July 7, 2012. Here we take the time to be `00:00:00:000000`. 

In [None]:
q1_1 = dt(2012, 7, 7,0,0,0,0) - dt(2006, 1, 2,0,0,0,0)
q1_1 = q1_1.days 
# YOUR CODE HERE
#raise NotImplementedError()

In [None]:
# TEST 

assert type(q1_1) == int
assert 2000 < q1_1 < 3000


#### Question 1.2

We can convert `datetime` objects to strings.  

In [None]:
date = dt(2011, 7, 30)

print(dt.strftime(date, '%Y-%m-%d'))

We can convert strings to `datetime` objects. Note that we need to specify the format of the dates.

In [None]:
datetime_string = '2011-07-30'

date = dt.strptime(datetime_string, '%Y-%m-%d')

type(date)

Please check the documentation for a [list of the formats](https://www.w3schools.com/python/python_datetime.asp).

Convert the string `2011/01/20` into a `datetime` object for January 20, 2011.

In [None]:
q1_2 = dt.strptime('2011/01/20', '%Y/%m/%d')

# YOUR CODE HERE
#raise NotImplementedError()

In [None]:
# TEST 

assert q1_2.year == 2011
assert q1_2.month == 1
assert q1_2.day == 20


#### Question 1.3 

Often we want to use `datetime` objects in the index of a `pandas` series or dataframe.

In [None]:
dates = [dt(2011, 1, 2), dt(2011, 1, 5), dt(2011, 1, 7)]

index_dates = pd.DatetimeIndex(dates)
index_dates

We have a `DatetimeIndex` meaning a `pandas` container that supports operations on dates and times. We can use the `pandas` function `to_datetime` to convert a list of strings to a `DatetimeIndex`.

In [None]:
dates = ['7/6/2011', '8/6/2011']

pd.to_datetime(dates, format = "%m/%d/%Y")

Convert the list of strings 

```python
['2010-01-04', '2010-01-05', '2010-01-06', '2010-01-07', '2010-01-08']
```

to a `DatetimeIndex` object with `to_datetime`. 

In [None]:
q1_3 = ['2010-01-04', '2010-01-05', '2010-01-06', '2010-01-07', '2010-01-08']
q1_3 = pd.to_datetime(q1_3, format = "%Y-%m-%d")
#type(q1_3)
# YOUR CODE HERE
#raise NotImplementedError()

In [None]:
# TEST 

assert len(q1_3) == 5


### 2. Date Ranges

Often we want to generate many dates according to a pattern. 

#### Question 2.1 

If we have a starting date and ending date, then we can fill in intermediate dates based on a frequency. 

In [None]:
pd.date_range(start = '2012-04-01', end = '2012-04-03', freq="D")

If we have a starting date, then we can add a certain number of periods based on a frequency. 

In [None]:
pd.date_range(start='4/1/2012', periods=3, freq="M")

Generate a `DatetimeIndex` containing 10 dates between `2010-01-01` and `2020-01-01` with frequency `Y`

In [None]:
q2_1 = pd.date_range(start = '2010-01-01', end = '2020-01-01', freq="Y")

# YOUR CODE HERE
#raise NotImplementedError()

In [None]:
# TEST 

assert len(q2_1) == 10


#### Question 2.2 

Commonly we use a `DatetimeIndex` object for the index of a `pandas` Series or DataFrame.

In [None]:
panel_data = pd.Series(data = range(5), index = pd.date_range(start='4/1/2012', periods=5, freq="M"))
panel_data

We have different ways to access entries. We could specify a `datetime` object.

In [None]:
panel_data[dt(2012, 5, 31)]

We could specify a string.

In [None]:
panel_data['2012-06-30']

If we want entries for a range of dates, then we could specify less information.

In [None]:
panel_data['2012']

Or we could specify a slice of dates like a slice of numbers.

In [None]:
panel_data['2012-04-30':'2012-06-15']

Access the entries from `panel_data` in June, July and August.

In [None]:
q2_2 = panel_data['2012-06':'2012-08']

# YOUR CODE HERE
#raise NotImplementedError()

In [None]:
# TEST 

assert np.all(q2_2.values == np.array([2, 3, 4]))


### 3. Time Zones

We will use the `pytz` package to deal with time zones. The package recognizes strings for time zones.

In [None]:
pytz.common_timezones[-5:]

By default a `DatetimeIndex` does not have a time zone. We can specify the time zone with the argument `tz`. 

In [None]:
index_dates = pd.date_range(start='4/1/2012', periods=5, freq="M", tz = "US/Mountain")

print(index_dates.tz)

We can add time zones to an existing `DatetimeIndex` using the `tz_localize` function.

In [None]:
index_dates = pd.date_range(start='4/1/2012', periods=5, freq="M")
index_dates_with_timezone = index_dates.tz_localize('US/Mountain')

print(index_dates_with_timezone.tz)

If we have included the time zone, then we can convert using the `tz_convert` function. 

In [None]:
index_dates_with_timezone = index_dates_with_timezone.tz_convert('US/Pacific')

print(index_dates_with_timezone.tz)

#### Question 3.1

**True/False** Is `US/Eastern` the time zone in `pytz` for New York City.

In [None]:
q3_1 = True

# YOUR CODE HERE
#raise NotImplementedError()

In [None]:
# TEST 

assert q3_1 in [True, False]


### Question 4

Having learned about different approaches for working with dates and times, we can process the data about stocks.

#### Question 4.1

Use the `pandas` function `read_csv` to load the data from `raw_data.csv` into a table called `df_raw`.

In [None]:
df_raw = pd.read_csv("raw_data.csv")

# YOUR CODE HERE
#raise NotImplementedError()

In [None]:
# TEST 

assert df_raw.shape == (23, 3)


#### Question 4.2

The column ```times_of_trade``` contains strings representing dates and times. The format is ```dd-mm-yyyy hh:mm:ss```. 

Create another table called ```df``` from ```df_raw```:

$1.$ Use the `pandas` function `rename` to change the name ```times_of_trade``` to ```Time``` 

$2.$ Use the `pandas` function ```to_datetime``` to convert each string in the ```Time``` column to a `datetime` object.


In [None]:
df = df_raw.rename(columns={"times_of_trade": "Time"})
df['Time'] = pd.to_datetime(df['Time'], format = "%d-%m-%Y %X")
#type(df['Time'][0])
#df['Time'][0]
#raise NotImplementedError()

In [None]:
# TEST 

assert np.all(df.columns == np.array(["Volume", "Price", "Time"]))


#### Question 4.3

We can use the `pandas` function `set_index` to make the `Time` column the index of the table.

In [None]:
df = df.set_index("Time")

The time zone should be `US/Pacific`. However, we did not specify `US/Pacific` in Question 4.2.

Modify the index of `df`: 

$1.$ Using the `pandas` function `tz_localize`, add the timezone `US/Pacific`.

$2.$ Using the `pandas` function `tz_convert`, convert the timezone to `US/Eastern`.

In [None]:
df.index = df.index.tz_localize('US/Pacific')
df.index = df.index.tz_convert('US/Eastern')

# YOUR CODE HERE
#raise NotImplementedError()

In [None]:
# TEST 

assert str(df.index.tz) == "US/Eastern"
