# 20 Dates and times
File(s) needed: country_timeseries.csv, banklist.csv, FB.csv


Dates can be a problem in any dataset. They are stored as numeric values, so correctly interpreting their values depends upon the underlying standard of their source. Did you know Excel uses two different start dates? https://support.microsoft.com/en-us/help/214330/differences-between-the-1900-and-the-1904-date-system-in-excel

There are also cultural diffences in how dates are represented. Dates in Europe are written with the day of the month first, then the month and year.

How about time zones? If time and dates are important pieces of your data, you might have to take the time zone into account. There are 4 time zones just across the contiguous 48 United States. Add Alaska, Hawai'i, and US possessions and the number is much larger. What time is it in Guam right now? https://www.timeanddate.com/worldclock/guam

You might need to use UTC (coordinated universal time) or GMT (Greenwich mean time) as a basis to standardize any time data you encounter. Or you might need to adjust formatting for an international audience. For any operations involving dates and times, Python provides the `datetime` module with many useful functions built in.
https://docs.python.org/3/library/datetime.html


In [1]:
# Of course we need to import the libraries
import pandas as pd
import datetime as dt

In [42]:
# Create a datetime object to hold the current date and time
now=dt.datetime.now()
print(now)

2020-11-17 14:50:31.834337


That may not be too helpful since it is difficult to read. The `datetime` object includes a method called `strftime` that allows you to specify a formatting string to control the display of the date and time. A list of the formatting options is available in the documentation: https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior

In [44]:
# Print a formatted date using strftime
# US style
print(now.strftime("%a, %B, %d, %Y"))
# European style
print(now.strftime("%A, %d %b, %Y"))

Tue, November, 17, 2020
Tuesday, 17 Nov, 2020


# Reading datetime data
We previously read data that included dates but had them read in as objects. Let's load the Ebola data as a reminder, then we'll convert the 'Date' column to a `datetime` type.

In [46]:
# Read the data from the csv file
ebola=pd.read_csv("../MIS-3335/data/country_timeseries.csv")

# Look at the first 5 columns of the first 5 rows
ebola.head()

Unnamed: 0,Date,Day,Cases_Guinea,Cases_Liberia,Cases_SierraLeone,Cases_Nigeria,Cases_Senegal,Cases_UnitedStates,Cases_Spain,Cases_Mali,Deaths_Guinea,Deaths_Liberia,Deaths_SierraLeone,Deaths_Nigeria,Deaths_Senegal,Deaths_UnitedStates,Deaths_Spain,Deaths_Mali
0,1/5/2015,289,2776.0,,10030.0,,,,,,1786.0,,2977.0,,,,,
1,1/4/2015,288,2775.0,,9780.0,,,,,,1781.0,,2943.0,,,,,
2,1/3/2015,287,2769.0,8166.0,9722.0,,,,,,1767.0,3496.0,2915.0,,,,,
3,1/2/2015,286,,8157.0,,,,,,,,3496.0,,,,,,
4,12/31/2014,284,2730.0,8115.0,9633.0,,,,,,1739.0,3471.0,2827.0,,,,,


In [49]:
# What types are the columns?
ebola.dtypes

Date                    object
Day                      int64
Cases_Guinea           float64
Cases_Liberia          float64
Cases_SierraLeone      float64
Cases_Nigeria          float64
Cases_Senegal          float64
Cases_UnitedStates     float64
Cases_Spain            float64
Cases_Mali             float64
Deaths_Guinea          float64
Deaths_Liberia         float64
Deaths_SierraLeone     float64
Deaths_Nigeria         float64
Deaths_Senegal         float64
Deaths_UnitedStates    float64
Deaths_Spain           float64
Deaths_Mali            float64
dtype: object

In [56]:
# Create a new datatime column named date_dt
ebola['date_dt']=pd.to_datetime(ebola['Date'])
#ebola['Date'].astype('datetime64')
ebola

Unnamed: 0,Date,Day,Cases_Guinea,Cases_Liberia,Cases_SierraLeone,Cases_Nigeria,Cases_Senegal,Cases_UnitedStates,Cases_Spain,Cases_Mali,Deaths_Guinea,Deaths_Liberia,Deaths_SierraLeone,Deaths_Nigeria,Deaths_Senegal,Deaths_UnitedStates,Deaths_Spain,Deaths_Mali,date_dt
0,1/5/2015,289,2776.0,,10030.0,,,,,,1786.0,,2977.0,,,,,,2015-01-05
1,1/4/2015,288,2775.0,,9780.0,,,,,,1781.0,,2943.0,,,,,,2015-01-04
2,1/3/2015,287,2769.0,8166.0,9722.0,,,,,,1767.0,3496.0,2915.0,,,,,,2015-01-03
3,1/2/2015,286,,8157.0,,,,,,,,3496.0,,,,,,,2015-01-02
4,12/31/2014,284,2730.0,8115.0,9633.0,,,,,,1739.0,3471.0,2827.0,,,,,,2014-12-31
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
117,3/27/2014,5,103.0,8.0,6.0,,,,,,66.0,6.0,5.0,,,,,,2014-03-27
118,3/26/2014,4,86.0,,,,,,,,62.0,,,,,,,,2014-03-26
119,3/25/2014,3,86.0,,,,,,,,60.0,,,,,,,,2014-03-25
120,3/24/2014,2,86.0,,,,,,,,59.0,,,,,,,,2014-03-24


The format of the date is implicit here: month, day, year. We can also use the `format=` parameter to pass a date format to make sure the data format is explicitly specified.

We can also convert date data to datetime type inside the `pd.read_csv` function. We do this by adding the `parse_dates=[column_number]` parameter. 

In [58]:
# Read data with parse_dates
ebola=pd.read_csv("../MIS-3335/data/country_timeseries.csv",parse_dates=[0])
ebola.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122 entries, 0 to 121
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Date                 122 non-null    datetime64[ns]
 1   Day                  122 non-null    int64         
 2   Cases_Guinea         93 non-null     float64       
 3   Cases_Liberia        83 non-null     float64       
 4   Cases_SierraLeone    87 non-null     float64       
 5   Cases_Nigeria        38 non-null     float64       
 6   Cases_Senegal        25 non-null     float64       
 7   Cases_UnitedStates   18 non-null     float64       
 8   Cases_Spain          16 non-null     float64       
 9   Cases_Mali           12 non-null     float64       
 10  Deaths_Guinea        92 non-null     float64       
 11  Deaths_Liberia       81 non-null     float64       
 12  Deaths_SierraLeone   87 non-null     float64       
 13  Deaths_Nigeria       38 non-null   

# Extracting date components
We can extract the different parts of the date, like month or day, from a datetime object.

Let's create a new column named "year" from the "Date" column data.

In [61]:
# Create year column
ebola['year']=ebola['Date'].dt.year
ebola.head()

Unnamed: 0,Date,Day,Cases_Guinea,Cases_Liberia,Cases_SierraLeone,Cases_Nigeria,Cases_Senegal,Cases_UnitedStates,Cases_Spain,Cases_Mali,Deaths_Guinea,Deaths_Liberia,Deaths_SierraLeone,Deaths_Nigeria,Deaths_Senegal,Deaths_UnitedStates,Deaths_Spain,Deaths_Mali,year
0,2015-01-05,289,2776.0,,10030.0,,,,,,1786.0,,2977.0,,,,,,2015
1,2015-01-04,288,2775.0,,9780.0,,,,,,1781.0,,2943.0,,,,,,2015
2,2015-01-03,287,2769.0,8166.0,9722.0,,,,,,1767.0,3496.0,2915.0,,,,,,2015
3,2015-01-02,286,,8157.0,,,,,,,,3496.0,,,,,,,2015
4,2014-12-31,284,2730.0,8115.0,9633.0,,,,,,1739.0,3471.0,2827.0,,,,,,2014


In [69]:
# We can also extract the month and day data to separate columns.
ebola['month'],ebola['day']=ebola['Date'].dt.month,ebola['Date'].dt.day

# just look at the head() for these date related columns 
ebola[['Date','year','month','day']].head()

Unnamed: 0,Date,year,month,day
0,2015-01-05,2015,1,5
1,2015-01-04,2015,1,4
2,2015-01-03,2015,1,3
3,2015-01-02,2015,1,2
4,2014-12-31,2014,12,31


In [70]:
# What data types are the new columns? 
ebola.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122 entries, 0 to 121
Data columns (total 21 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Date                 122 non-null    datetime64[ns]
 1   Day                  122 non-null    int64         
 2   Cases_Guinea         93 non-null     float64       
 3   Cases_Liberia        83 non-null     float64       
 4   Cases_SierraLeone    87 non-null     float64       
 5   Cases_Nigeria        38 non-null     float64       
 6   Cases_Senegal        25 non-null     float64       
 7   Cases_UnitedStates   18 non-null     float64       
 8   Cases_Spain          16 non-null     float64       
 9   Cases_Mali           12 non-null     float64       
 10  Deaths_Guinea        92 non-null     float64       
 11  Deaths_Liberia       81 non-null     float64       
 12  Deaths_SierraLeone   87 non-null     float64       
 13  Deaths_Nigeria       38 non-null   

# Date calculations and timedeltas
A big reason for saving date and time data in the `datetime` data structure is that we can then easily do calcuations with them. An original column in teh Ebola data is 'Day', which represents the number of days since the beginning of an Ebola outbreak in that country. We can use some basic date math to recreate those values as an example.

The beginning of the outbreak will be the first date (i.e., the minimum date). We can subtract that value from each date to get the number of days. Any time we do calculations on `datetime` objects we get `timedelta` objects as a result.

In [71]:
# a reminder of the data in ebola, starting at the bottom
print(ebola[['Date','Day']].tail())

# What is the first date?
print(ebola['Date'].min())

          Date  Day
117 2014-03-27    5
118 2014-03-26    4
119 2014-03-25    3
120 2014-03-24    2
121 2014-03-22    0
2014-03-22 00:00:00


In [73]:
# use the first date in calculating a new column of date differences
ebola['outbreak_d']=ebola['Date']-ebola['Date'].min()
# What does the result look like?
ebola

Unnamed: 0,Date,Day,Cases_Guinea,Cases_Liberia,Cases_SierraLeone,Cases_Nigeria,Cases_Senegal,Cases_UnitedStates,Cases_Spain,Cases_Mali,...,Deaths_SierraLeone,Deaths_Nigeria,Deaths_Senegal,Deaths_UnitedStates,Deaths_Spain,Deaths_Mali,year,month,day,outbreak_d
0,2015-01-05,289,2776.0,,10030.0,,,,,,...,2977.0,,,,,,2015,1,5,289 days
1,2015-01-04,288,2775.0,,9780.0,,,,,,...,2943.0,,,,,,2015,1,4,288 days
2,2015-01-03,287,2769.0,8166.0,9722.0,,,,,,...,2915.0,,,,,,2015,1,3,287 days
3,2015-01-02,286,,8157.0,,,,,,,...,,,,,,,2015,1,2,286 days
4,2014-12-31,284,2730.0,8115.0,9633.0,,,,,,...,2827.0,,,,,,2014,12,31,284 days
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
117,2014-03-27,5,103.0,8.0,6.0,,,,,,...,5.0,,,,,,2014,3,27,5 days
118,2014-03-26,4,86.0,,,,,,,,...,,,,,,,2014,3,26,4 days
119,2014-03-25,3,86.0,,,,,,,,...,,,,,,,2014,3,25,3 days
120,2014-03-24,2,86.0,,,,,,,,...,,,,,,,2014,3,24,2 days


In [74]:
# What type is the new column?
ebola.dtypes

Date                    datetime64[ns]
Day                              int64
Cases_Guinea                   float64
Cases_Liberia                  float64
Cases_SierraLeone              float64
Cases_Nigeria                  float64
Cases_Senegal                  float64
Cases_UnitedStates             float64
Cases_Spain                    float64
Cases_Mali                     float64
Deaths_Guinea                  float64
Deaths_Liberia                 float64
Deaths_SierraLeone             float64
Deaths_Nigeria                 float64
Deaths_Senegal                 float64
Deaths_UnitedStates            float64
Deaths_Spain                   float64
Deaths_Mali                    float64
year                             int64
month                            int64
day                              int64
outbreak_d             timedelta64[ns]
dtype: object

# Datetime methods

In [77]:
# load the bank data
banks=pd.read_csv("../MIS-3335/data/banklist.csv")
banks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 553 entries, 0 to 552
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Bank Name              553 non-null    object
 1   City                   553 non-null    object
 2   ST                     553 non-null    object
 3   CERT                   553 non-null    int64 
 4   Acquiring Institution  553 non-null    object
 5   Closing Date           553 non-null    object
 6   Updated Date           553 non-null    object
dtypes: int64(1), object(6)
memory usage: 30.4+ KB


In [78]:
# load again, but this time with the dates parsed
banks=pd.read_csv("../MIS-3335/data/banklist.csv",parse_dates=[5,6])
banks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 553 entries, 0 to 552
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   Bank Name              553 non-null    object        
 1   City                   553 non-null    object        
 2   ST                     553 non-null    object        
 3   CERT                   553 non-null    int64         
 4   Acquiring Institution  553 non-null    object        
 5   Closing Date           553 non-null    datetime64[ns]
 6   Updated Date           553 non-null    datetime64[ns]
dtypes: datetime64[ns](2), int64(1), object(4)
memory usage: 30.4+ KB


In [79]:
# We can create new columns for the year and quarter each bank closed using datetime methods
banks['closing_qrt']=banks["Closing Date"].dt.quarter
banks['closing_yr']=banks["Closing Date"].dt.year
banks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 553 entries, 0 to 552
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   Bank Name              553 non-null    object        
 1   City                   553 non-null    object        
 2   ST                     553 non-null    object        
 3   CERT                   553 non-null    int64         
 4   Acquiring Institution  553 non-null    object        
 5   Closing Date           553 non-null    datetime64[ns]
 6   Updated Date           553 non-null    datetime64[ns]
 7   closing_qrt            553 non-null    int64         
 8   closing_yr             553 non-null    int64         
dtypes: datetime64[ns](2), int64(3), object(4)
memory usage: 39.0+ KB


In [80]:
# Calculate how many banks closed in each year
banks.groupby(['closing_yr']).size()

closing_yr
2000      2
2001      4
2002     11
2003      3
2004      4
2007      3
2008     25
2009    140
2010    157
2011     92
2012     51
2013     24
2014     18
2015      8
2016      5
2017      6
dtype: int64

In [87]:
# Calculate how many banks closed in each quarter of each year
banks.groupby(['closing_yr','closing_qrt']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Bank Name,City,ST,CERT,Acquiring Institution,Closing Date,Updated Date
closing_yr,closing_qrt,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2000,4,2,2,2,2,2,2,2
2001,1,1,1,1,1,1,1,1
2001,2,1,1,1,1,1,1,1
2001,3,2,2,2,2,2,2,2
2002,1,6,6,6,6,6,6,6
2002,2,2,2,2,2,2,2,2
2002,3,1,1,1,1,1,1,1
2002,4,2,2,2,2,2,2,2
2003,1,1,1,1,1,1,1,1
2003,2,1,1,1,1,1,1,1


# Subsetting based on dates
Since we know how to get the parts of the date values from a column, we can use that to subset the data with a boolean condition. In this example, we'll use a compound boolean statement to combine two conditions.

In [None]:
# Only get the data for bank closings that occurred in April of 2010


# Resampling
We can do three types of resampling:
1. Downsampling: convert from higher frequency data to lower frequency, like daily to monthly.
2. Upsampling: convert from lower frequency data to higher frequency, like monthly to daily.
3. No frequency change: shift the time period but don't change the frequency, like changing from every Monday to every Thursday.

You are most likely to need to downsample. For example, if you have daily data, you might need to use monthly or quarterly data at times. To do that, we will need to aggregate or subset the data, depending upon the data and the context.

https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#resampling

The field containing the date to be resampled has to be set as the index for the data frame. We then use the data frame method `resample` and a frequency alias to specify the frequency of the modified data. The frequency codes we can use are described here: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects and summarized on page 229 of the book.

Let's look at Facebook stock price data as an example. 

In [None]:
# Load the Facebook stock data with the date set as the index value


In [None]:
# What does the data look like?


In [None]:
# Downsample to get monthly data based on mean values


In [None]:
# What is the average closing price for each month?


# Time zones
Pandas has a library called `pytz` you should use if you need to work with time zones. Documentation is at http://pytz.sourceforge.net/