# Time series

* One of the great powers of *Pandas* is handling time series. 
* A "time series" is a structure in which time is an index. 
* The *Pandas* abstraction `Series` was invented -- as a generalization of `numpy.ndarray` -- specifically to handle time-based indexing. 
* Time series are important in all kinds of forecasting, both as input and output to forecasting algorithms. 

# Special treatment of time series in *Pandas*
* Enhanced parsing of dates and times from CSV files. 
* Ability to use time as an index. 

# Times, dates, and datetimes
* One thing to get used to is how Python conceives of time. 
* The basic structure of time measurement is a "datetime". 
* This is a concatenation of a date and a time. 
* Dates are also meaningful, separately from Datetime. 
* Times -- independent from dates -- are not of interest in data analysis. 

# Getting used to datetimes
* Let's get used to datetimes by measuring the current time of day: 

In [1]:
from datetime import datetime
print(datetime.now())

2019-06-13 11:24:23.640783


# Notes on now()
* Reported in timezone of server. 
* At system clock resolution. 
* Roughly the same as reported on other servers. 
* Times are synchronized nationwide via the "Network Time Protocol" (NTP).

# Problems with time
* Many different formats with some ambiguities. 
  
 * `year-month-date` versus 
 * `month/date/year` versus 
 * `date/month/year`. 

* Data leaves out any component that isn't important to the study: 
  
 * `year-month-date hour:minute:second` or 
 * `year-month-date` or
 * `year-month` or just 
 * `year`
 
 
 # How *Pandas* deals with datetimes
 1. Parse the whole file before interpreting dates. 
 2. Look for evidence of each pattern in all dates contained in the file. 
 3. Rule out options until one is left. 
 
 # Determining which date option to use
 * If I see `6-30-2019`, I know it's `month-day-year`, because `day-month-year` and `year-month-day` are impossible dates! 
 * Likewise, if I see `2019-6-30`, I know it's `year-month-day`. 
 * There are difficult cases: `6-6-2019` could be `month-day-year` or `day-month-year`.
 * Thus, *Pandas uses evidence from the whole file* to determine which parsing mechanism to use. 
 
 # Here's an example: 

In [2]:
%more date1.csv

In [3]:
import pandas as pd
date1 = pd.read_csv('date1.csv', header=0, parse_dates=[0], index_col=0)
date1

Unnamed: 0_level_0,event,person
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-06-20 20:30:40,first version,alva
2019-06-21 10:40:00,second version,george
2019-06-22 08:40:00,first release,alva


## Let's take this apart: 
* `header=0`: read headings from first row. 
* `parse_dates=[0]`: first column is a datetime. 
* `index_col=0`: use first column as index.
* (It is typical that the index of a time series would be a datetime.)

## Things to note
* The format changed to the canonical internal format. 

# Querying using datetimes
* Obviously, you can query using a datetime.
* Consider:

In [4]:
date1[date1.index > pd.to_datetime('2019-06-22')]

Unnamed: 0_level_0,event,person
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2019-06-22 08:40:00,first release,alva


## Let's take this apart
* `pd.to_datetime(string)`: converts a string to a datetime. 
* `date1.index > pd.to_datetime(string)`: the index is a datetime. It's not a regular column. So we need to refer 
   to it by the name `date1.index` rather than `date1.date`. 
   
## Things to note
* `2019-06-22` is interpreted as `2019-06-22 00:00:00.0`. 

# Another example from data.gov
* Dog licenses granted in Allegheny County, PA, City of Pittsburgh

In [1]:
%more 2099.csv

In [6]:
# Dog licenses granted in Allegheny County, PA. 
dogs = pd.read_csv('2099.csv', index_col=6, parse_dates=[6])
dogs.head()

Unnamed: 0_level_0,LicenseType,Breed,Color,DogName,OwnerZip,ExpYear
ValidDate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2009-05-27 14:44:00,Dog Lifetime Neutered Male,MIXED,WHITE/BLACK,BUDDY,15140,2099
2009-06-23 15:31:00,Dog Lifetime Spayed Female,GER SHORTHAIR POINT,SPOTTED,CABELA,15236,2099
2011-06-15 13:37:00,Dog Lifetime Male,BEAGLE MIX,BLACK/BROWN,WATSON,15106,2099
2018-06-05 12:39:00,Dog Senior Lifetime Spayed Female,BORD COLLIE MIX,WHITE/BLACK,SADIE,15227,2099
2008-08-22 12:43:00,Dog Lifetime Spayed Female,MIXED,BROWN,CHLOE,15132,2099




# The problem of alignment
* So far, indexes are just integers. 
* Real-world data is often indexed in time. 
* *Measurements at different times are of different states.* 
* It is necessary to align data with the times at which they occur. 
* This is the purpose of a `Series`. 

# Goals of alignment
* assure that one index of data refers to one state of the world, e.g., in time. 
* Merge data sources that describe different times without ambiguity. 

# Aside: citizen science
* It's often true in geo-informatics that data arise from unlikely sources. 
* They might describe the same or different timestamps. 
* Series provide a way to merge data from various sources. 

# Differences between `Series` and `ndarray`s. 

| ndarray | Series | 
|---------|--------|
| axes are always numeric | axes can be other things, including time or date |
| Combining things requires objects of the same length | different length objects can be combined |

# `ndarray` is a special case 
* The default axis of a Series is an integer range. 
* Thus, *in the default case, Series and ndarray represent the same things.* 
class pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)[source]


In [7]:
import pandas as pd
s = pd.Series(data=[1,2], index=['July 4, 1776', 'November 9, 2018'], name='measurements')
s

July 4, 1776        1
November 9, 2018    2
Name: measurements, dtype: int64

In [8]:
dates = pd.to_datetime(['July 4, 1776', 'Nov 1, 2018', 'Sept 9, 2019'])
dates

DatetimeIndex(['1776-07-04', '2018-11-01', '2019-09-09'], dtype='datetime64[ns]', freq=None)

In [9]:
t = pd.Series(data=[1,2,3], index=dates)
t

1776-07-04    1
2018-11-01    2
2019-09-09    3
dtype: int64

In [10]:
t.astype(object)

1776-07-04    1
2018-11-01    2
2019-09-09    3
dtype: object

In [11]:
pd.to_timedelta("3 days")

Timedelta('3 days 00:00:00')

In [12]:
pd.to_timedelta("2h 3m 2s")

Timedelta('0 days 02:03:02')


|Concept	| Scalar Class	| Array Class	| pandas Data Type| 	Primary Creation Method|
|----------|----------------|---------------|-----------------|------------------------|
|Date times |	Timestamp	| DatetimeIndex	| datetime64[ns] or datetime64[ns, tz]	|to_datetime or date_range
|Time deltas|	Timedelta	|TimedeltaIndex	|timedelta64[ns]	| to_timedelta or timedelta_range|
|Time spans	|Period|	PeriodIndex	|period[freq]	| Period or period_range
|Date offsets	|DateOffset|	None	|None	|DateOffset|

Source: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html