In [2]:
import pandas as pd
import numpy as np

# Chapter 2. Finding and Wrangling Time Series Data

## Where to Find Time Series Data


### Prepared Data Sets

The UCI Machine Learning Repository : https://archive.ics.uci.edu/ml/index.php

The UEA and UCR Time Series Classification Repository: http://www.timeseriesclassification.com/

Government time series data sets: 

https://www.ncdc.noaa.gov/cdo-web/datasets

https://www.bls.gov/

https://fred.stlouisfed.org/

https://www.cdc.gov/flu/weekly/fluviewinteractive.htm

Additional:

https://www.comp-engine.org/

https://cran.r-project.org/web/packages/Mcomp/index.html

https://github.com/carlanetto/M4comp2018

### A Worked Example: Assembling a Time Series Data Collection

In [9]:
emails = pd.read_csv('BookRepo-master/Ch02/data/emails.csv')
YearJoined = pd.read_csv('BookRepo-master/Ch02/data/year_joined.csv')
donations =  pd.read_csv('BookRepo-master/Ch02/data/donations.csv')

In [15]:
YearJoined.head(1)

Unnamed: 0,user,userStats,yearJoined
0,0,silver,2014


In [16]:
## python
YearJoined.groupby('user').count().groupby('userStats').count()

Unnamed: 0_level_0,yearJoined
userStats,Unnamed: 1_level_1
1,1000


In [18]:
emails.head(1)

Unnamed: 0,emailsOpened,user,week
0,3.0,1.0,2015-06-29 00:00:00


In [20]:
## python
emails[emails.emailsOpened < 1]

Unnamed: 0,emailsOpened,user,week


In [21]:
emails[emails.user == 998]

Unnamed: 0,emailsOpened,user,week
25464,1.0,998.0,2017-12-04 00:00:00
25465,3.0,998.0,2017-12-11 00:00:00
25466,3.0,998.0,2017-12-18 00:00:00
25467,3.0,998.0,2018-01-01 00:00:00
25468,3.0,998.0,2018-01-08 00:00:00
25469,2.0,998.0,2018-01-15 00:00:00
25470,3.0,998.0,2018-01-22 00:00:00
25471,2.0,998.0,2018-01-29 00:00:00
25472,3.0,998.0,2018-02-05 00:00:00
25473,3.0,998.0,2018-02-12 00:00:00


In [30]:
(max(emails[emails.user == 998].week))

'2018-05-28 00:00:00'

In [35]:
min(emails[emails.user == 998].week)

'2017-12-04 00:00:00'

In [None]:
(max(emails[emails.member == 998].week) -
min(emails[emails.member == 998].week)).days/7

In [36]:
emails[emails.user == 998].shape

(24, 3)

In [37]:
complete_idx = pd.MultiIndex.from_product((set(emails.week),
set(emails.user)))

In [None]:
all_email = emails.set_index(['week', 'member']).
reindex(complete_idx, fill_value = 0).
reset_index()


In [None]:
all_email.columns = ['week', 'member', 'EmailsOpened']


In [None]:
## python
cutoff_dates = emails.groupby('member').week.
agg(['min', 'max']).reset_index)
cutoff_dates = cutoff_dates.reset_index()


In [None]:
## python
>>> for _, row in cutoff_dates.iterrows():
>>> member = row['member']
>>> start_date = row['min']
>>> end_date = row['max']
>>> all_email.drop(
all_email[all_email.member == member]
[all_email.week < start_date].index, inplace=True)
>>> all_email.drop(all_email[all_email.member == member]
[all_email.week > end_date].index, inplace=True)

In [None]:
## python
>>> donations.timestamp = pd.to_datetime(donations.timestamp)
>>> donations.set_index('timestamp', inplace = True)
>>> agg_don = donations.groupby('member').apply(
lambda df: df.amount.resample("W-MON").sum().dropna())

In [None]:
## python
>>> for member, member_email in all_email.groupby('member'):
>>> member_donations = agg_donations[agg_donations.member
== member]
>>> member_donations.set_index('timestamp', inplace = True)
>>> member_email.set_index ('week', inplace = True)

>>> member_email = all_email[all_email.member == member]
>>> member_email.sort_values('week').set_index('week')
>>> df = pd.merge(member_email, member_donations, how = 'left',
left_index = True,
right_index = True)
>>> df.fillna(0)
>>> df['member'] = df.member_x
>>> merged_df = merged_df.append(df.reset_index()
[['member', 'week', 'emailsOpened',
'amount']])

In [None]:
## python
>>> df = merged_df[merged_df.member == 998]
>>> df['target'] = df.amount.shift(1)
>>> df = df.fillna(0)
>>> df

In [None]:
## python
>>> df['dt'] = df.time - df.time.shift(-1)

## Cleaning Your Data

### Handling Missing Data

In [1]:
Missing data
Changing the frequency of a time series (that is, upsampling and
downsampling)
Smoothing data
Addressing seasonality in data
Preventing unintentional lookaheads

SyntaxError: invalid syntax (<ipython-input-1-761fdbb6bded>, line 1)

# a few specific ways to fill in numbers for those missing values:

Forward fill

Moving average

Interpolation

### LOKAHEAD

The term lookahead is used in time series analysis to denote any

knowledge of the future. You shouldn’t have such knowledge when

designing, training, or evaluating a model. A lookahead is a way, through

data, to find out something about the future earlier than you ought to

know it.

### Downsampling:

downsampling is as simple as selecting

out every nth element

### Upsampling:

Irregular time series

Inputs sampled at different frequencies

Knowledge of time series dynamics

### Smoothing Data

Exponential smoothing

In [3]:
import pandas as pd

In [18]:
Columns = ['Date','Passengers']
air = pd.read_csv('BookRepo-master\Ch02\data\AirPassengers.csv', header=None)
air.columns = Columns

In [19]:
air.head()

Unnamed: 0,Date,Passengers
0,1949-01,112
1,1949-02,118
2,1949-03,132
3,1949-04,129
4,1949-05,121


In [20]:
air['Smooth.5'] = pd.ewma(air, alpha = .5).Passengers
air['Smooth.9'] = pd.ewma(air, alpha = .9).Passengers

AttributeError: module 'pandas' has no attribute 'ewma'

In [34]:
air['Smooth.5'] = air.ewm(alpha = .5).mean().Passengers
air['Smooth.9'] = air.ewm(alpha = .9).mean().Passengers

In [36]:
air.head(11)

Unnamed: 0,Date,Passengers,Smooth.5,Smooth.9
0,1949-01,112,112.0,112.0
1,1949-02,118,116.0,117.454545
2,1949-03,132,125.142857,130.558559
3,1949-04,129,127.2,129.155716
4,1949-05,121,124.0,121.815498
5,1949-06,135,129.587302,133.681562
6,1949-07,148,138.866142,146.568157
7,1949-08,148,143.45098,147.856816
8,1949-09,136,139.7182,137.185682
9,1949-10,119,129.348974,120.818568


Kalman and LOESS incorporate data both earlier and later in

time, so if you use these methods keep in mind the leak of information

backward in time, as well as the fact that they are usually not appropriate for

preparing data to be used in forecasting applications.

# Seasonal Data

# Time Zones

In [44]:
import datetime
import pytz


In [39]:
datetime.datetime.utcnow()

datetime.datetime(2019, 11, 27, 14, 41, 59, 738794)

In [40]:
datetime.datetime.now()

datetime.datetime(2019, 11, 27, 11, 42, 11, 368460)

In [42]:
datetime.datetime.now(datetime.timezone.utc)

datetime.datetime(2019, 11, 27, 14, 42, 21, 714051, tzinfo=datetime.timezone.utc)

In [45]:
western = pytz.timezone('US/Pacific')
western.zone

'US/Pacific'

In [46]:
loc_dt = western.localize(datetime.datetime(2018, 5, 15, 12, 34, 0))

In [47]:
loc_dt

datetime.datetime(2018, 5, 15, 12, 34, tzinfo=<DstTzInfo 'US/Pacific' PDT-1 day, 17:00:00 DST>)

In [48]:
pytz.common_timezones

['Africa/Abidjan', 'Africa/Accra', 'Africa/Addis_Ababa', 'Africa/Algiers', 'Africa/Asmara', 'Africa/Bamako', 'Africa/Bangui', 'Africa/Banjul', 'Africa/Bissau', 'Africa/Blantyre', 'Africa/Brazzaville', 'Africa/Bujumbura', 'Africa/Cairo', 'Africa/Casablanca', 'Africa/Ceuta', 'Africa/Conakry', 'Africa/Dakar', 'Africa/Dar_es_Salaam', 'Africa/Djibouti', 'Africa/Douala', 'Africa/El_Aaiun', 'Africa/Freetown', 'Africa/Gaborone', 'Africa/Harare', 'Africa/Johannesburg', 'Africa/Juba', 'Africa/Kampala', 'Africa/Khartoum', 'Africa/Kigali', 'Africa/Kinshasa', 'Africa/Lagos', 'Africa/Libreville', 'Africa/Lome', 'Africa/Luanda', 'Africa/Lubumbashi', 'Africa/Lusaka', 'Africa/Malabo', 'Africa/Maputo', 'Africa/Maseru', 'Africa/Mbabane', 'Africa/Mogadishu', 'Africa/Monrovia', 'Africa/Nairobi', 'Africa/Ndjamena', 'Africa/Niamey', 'Africa/Nouakchott', 'Africa/Ouagadougou', 'Africa/Porto-Novo', 'Africa/Sao_Tome', 'Africa/Tripoli', 'Africa/Tunis', 'Africa/Windhoek', 'America/Adak', 'America/Anchorage', 'Amer

In [49]:
pytz.country_timezones('tr')

['Europe/Istanbul']

# Preventing Lookahead

In [51]:
"""
Intentionally introduce a lookahead and see how your model
behaves. Try various degrees of lookahead, so you have an idea how
it shifts accuracy. If you have some idea of the accuracy with
lookahead, you have an idea of what the ceiling on a real model
without unfair knowledge of the future will do. Remember that many
time series problems are extremely difficult, so a model with a
lookahead may seem great until you realize you are dealing with a
high-noise/low-signal data set.

Add features slowly, particularly features you might be processing,
so that you can look for jumps. One sign of a lookahead is when a
particular feature is unexpectedly good, and there isn’t a very good
explanation. At the top of your explanation list should always be
“lookahead.”
"""

'\nIntentionally introduce a lookahead and see how your model\nbehaves. Try various degrees of lookahead, so you have an idea how\nit shifts accuracy. If you have some idea of the accuracy with\nlookahead, you have an idea of what the ceiling on a real model\nwithout unfair knowledge of the future will do. Remember that many\ntime series problems are extremely difficult, so a model with a\nlookahead may seem great until you realize you are dealing with a\nhigh-noise/low-signal data set.\n\nAdd features slowly, particularly features you might be processing,\nso that you can look for jumps. One sign of a lookahead is when a\nparticular feature is unexpectedly good, and there isn’t a very good\nexplanation. At the top of your explanation list should always be\n“lookahead.”\n'

# Chapter 3. Exploratory Data Analysis for Time Series

## Familiar Methods

In [52]:
"""
You will want to address the
same exploratory questions you would ask about any new data set, such as:

Are any of the columns strongly correlated with one another?

What is the overall mean of an interesting variable? What is its
variance?

To answer these, you can use familiar techniques such as plotting, taking
summary statistics, applying histograms, and using targeted scatter plots.
"""

'\nYou will want to address the\nsame exploratory questions you would ask about any new data set, such as:\n\nAre any of the columns strongly correlated with one another?\n\nWhat is the overall mean of an interesting variable? What is its\nvariance?\n'

In [53]:
"""
What is the range of values you see, and do they vary by time period
or some other logical unit of analysis?

Does the data look consistent and uniformly measured, or does it
suggest changes in either measurement or behavior over time?

To answer these, you can use familiar techniques such as plotting, taking
summary statistics, applying histograms, and using targeted scatter plots.

"""

'\nWhat is the range of values you see, and do they vary by time period\nor some other logical unit of analysis?\n\nDoes the data look consistent and uniformly measured, or does it\nsuggest changes in either measurement or behavior over time?\n\n'