In [None]:
import sys; sys.path.append("..")
from utils import count_down

.
# Analyzing data with Pandas
or
# Special data types, pivot tables and time series in Pandas

In [None]:
import sys

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Data Types

## Finding the right data types
Data can be expressed in several levels of measurement. You need to make sure to find the level of measurement that both semantically and computationally makes sense.

A quick detour to scales of measurement
1. **Nominal level** <br/>
   Numbers only represent categories and nothing more. <br/>
   E.g.: genders, colors<br/>
   You can compute: absolute and relative frequencies, mode   
   
1. **Ordinal level** <br/>
   The order has a meaning.<br/>
   E.g.: school-grades, music charts, answers on a likert scale <br/>
   You can additionally compute: cumulative frequencies, median, quantiles   
   
1. **Interval level** <br/>
   Same intervals should convey same meaning.<br/>
   E.g.: temperature in celsius, (intelligence) tests<br/>
   You can additionally compute: mean, standard deviation   

1. **Ratio level**<br/>
   Ratios convey meaning and there is a specific 0 point.<br/>
   E.g.: mass, size, time, speed<br/>
   You can compute: coefficient of variation $c = \frac{s}{\bar X}$, i.e. a normalized standard variation 


## Categorical data
https://pandas.pydata.org/pandas-docs/stable/categorical.html

Using a categorical dtype has several advantages

* it keeps memory usage low
* it makes data useable for numeric modeling algorithms
* it signals to libraries that build on pandas how to treat the data
* it makes the intent clear, that only certain values are allowed in a column and how they relate to each other

The following `Series` could be perfectly represented using categories instead of strings.

In [None]:
s = pd.Series(['a','b', 'b', 'a', 'c', 'c'])
s

In [None]:
print(f'The string series is {s.nbytes} bytes big.')

By specifying the `dtype` as "category" the data is automatically converted to a categorical scale.

In [None]:
s = pd.Series(['a','b', 'b', 'a', 'c', 'c'], dtype='category')
s

In fact the `Series` gets much smaller already. The effect will be stronger on larger `Series`.

In [None]:
print(f'The categorical series is {s.nbytes} bytes big.')

Categorical data is stored using numeric codes under the hood that map to categories.

In [None]:
s.cat.categories

In [None]:
s.cat.codes

Using `dtype='category'` will create unordered categories by default.

In [None]:
s.cat.ordered

The `.cat` accessor allows changing, renaming and ordering categories.

In [None]:
s.cat.categories

In [None]:
s.cat.rename_categories(['x', 'y', 'z'])

A categorical series can also be created from `pd.Categorical`. This allows you set the categories and the ordering explicitly.

In [None]:
pd.Categorical(['a', 'b', 'c', 'a'], categories=['b', 'c'],
                         ordered=False)

The `Categorical` object can then be passed to the `Series` constructor to obtain a real `Series`.

In [None]:
cat_series = pd.Series(
    pd.Categorical(['a', 'b', 'c', 'a'], categories=['b', 'c', 'a'],
                         ordered=False)
)
cat_series

### Ordered Categories
What does it mean to have ordered categories?

In [None]:
cat_series2 = pd.Series(
    pd.Categorical(['c', 'a', 'c', 'b'], categories=['b', 'c', 'a'],
                         ordered=False)
)
cat_series2

In [None]:
cat_series == cat_series2

In [None]:
cat_series > cat_series2

In [None]:
cat_series

In [None]:
cat_series.mode()

In [None]:
cat_series.max()

These semantics are lost when you pull out the atomic values. Only the `Series` is categorical, not the single entries.

In [None]:
cat_series.iloc[0], type(cat_series.iloc[0])

In [None]:
cat_series.iloc[0] < cat_series.iloc[1]

Now the same for an **ordered** cateogorical `Series`.

In [None]:
cat_ordered_series = pd.Series(
    pd.Categorical(['a', 'b', 'c', 'a'], categories=['b', 'c', 'a', 'd'],
                         ordered=True)
)
cat_ordered_series

In [None]:
cat_ordered_series2 = pd.Series(
    pd.Categorical(['c', 'a', 'c', 'b'], categories=['b', 'c', 'a', 'd'],
                    ordered=True)
)
cat_ordered_series2

In [None]:
cat_ordered_series > cat_ordered_series2

In [None]:
cat_ordered_series.max()

In [None]:
cat_ordered_series == cat_ordered_series2

In [None]:
cat_ordered_series.equals(cat_ordered_series2)

The median does still not work right now, but if you really need to you can compute it from the codes.

In [None]:
cat_ordered_series

In [None]:
cat_ordered_series.cat.codes.median()

If you need to cast existing data, to a categorical type and want to specify the categories and the ordering you can use `pd.CategoricalDtype` to create your own categorical datatype. It works the same way as `pd.Categorical` except that you do not pass the data. The newly created datatype can then be used in an `.astype` cast.

In [None]:
series = pd.Series(['a', 'b', 'c', 'a'])
series

In [None]:
from pandas.api.types import CategoricalDtype
cat_type = CategoricalDtype(categories=['b', 'c', 'a'],
                             ordered=True)
cat_type

In [None]:
series.astype(cat_type)

Now let's look at a real world dataset and some discretization techniques. The titanic dataset contains features about passengers of the tragic Titanic journey. A common introductory machine learning excercise is to predict survial of the passengers based on the features (see https://www.kaggle.com/c/titanic/data ).

In [None]:
titanic = pd.read_csv('data/titanic.csv')
titanic.head()

In [None]:
titanic.dtypes

We include all columns in the description as `object` columns are described differently than `numeric` columns and are excluded from the description by default.

In [None]:
titanic.describe(include='all')

Let's extend the port of embarkation with the full name to make things a bit more readable. We use a simple merge operation to achieve this.  

In [None]:
embarked_map = pd.DataFrame({'Embarked': ['C', 'Q', 'S'],
                             'EmbarkedLong': ['Cherbourg', 'Queenstown', 'Southampton']})
embarked_map

In [None]:
titanic = titanic.merge(embarked_map).sort_values(by='PassengerId')
titanic.head()

In [None]:
titanic.dtypes

Since the `EmbarkedLong` column has only three distinct values it is naural to represent it using categories.

In [None]:
titanic['EmbarkedLong'].unique()

In [None]:
titanic['EmbarkedLong'] = titanic['EmbarkedLong'].astype('category')
titanic['EmbarkedLong'].head()

In [None]:
titanic.dtypes

The description for a categorical column is the same as for an `object` column.

In [None]:
titanic['EmbarkedLong'].describe()

### Exercise

Convert the Sex-Column to a category-datatype

In [None]:
count_down(1)

In [None]:
titanic['Sex'] = titanic['Sex'].astype('category')
titanic.dtypes

## Discretizing continuous values (Tiling)
Sometimes it makes sense to convert numeric into categorical data. For example, for some problems the exact age of a person might not matter, but only whether the person is underaged or not. This process of conversion is called tiling.

https://pandas.pydata.org/pandas-docs/stable/basics.html#discretization-and-quantiling

In [None]:
titanic['Age'].describe()

Using `cut` we can discretize numeric values.

In [None]:
titanic['Age'].head(7)

In [None]:
pd.cut(titanic['Age'], bins=3).head(7)

By default `cut` will split the data into equally sized intervals. As this seldomly makes sense, we can set the bin edges ourself.

In [None]:
pd.cut(titanic['Age'], bins=[0, 17, 67, 80], include_lowest=True).head(7)

In [None]:
pd.cut(titanic['Age'], bins=[0, 17, 67, 80], include_lowest=True).value_counts() #include_lowest only asks for the first interval

In [None]:
pd.cut(titanic['Age'], bins=[0, 17, 67, 80]).value_counts()

If you set the bin edges manually, be sure to cover the whole range as values not falling into an interval will be set to NA.

In [None]:
pd.cut(titanic['Age'], 
       bins=[64, 66, 67, 80],
       labels=['child', 'grown-up', 'senior']).head(7)

In [None]:
titanic['Age_coarse'] = pd.cut(titanic['Age'], bins=[0, 17, 67, 80], labels=['child', 'grown-up', 'senior'])

A related function is `qcut`, which cuts at quantiles.

In [None]:
pd.qcut(titanic['Age'], 3).head()

To understand `pd.qcut`, let's look at the distribution of the data using pandas plotting tools.

In [None]:
titanic['Age'].hist(density=True)

Pandas plotting tools are in general just calling the corresponding `matplotlib` functions, but are more convenient, as they do not require as to remove NA. Here is the equivalent plot in `matplotlib`. 

In [None]:
plt.hist(titanic['Age'].dropna().values, density=True)

So `qcut` divides the data into bins such that an equal number of values will fall in each bin. We can also check this visually with Matplotlib (and, more easily, with `seaborn`).

In [None]:
pd.qcut(titanic['Age'], 3).value_counts().values

In [None]:
plt.bar(np.arange(3), height=pd.qcut(titanic['Age'], 3).value_counts().values)

In [None]:
sns.countplot(pd.qcut(titanic['Age'], 4))

In [None]:
age = titanic['Age']
age[(0.4 < age) & (age <= 22)].count() / age.count()

In [None]:
age[(22 < age) & (age <= 33)].count() / age.count()

In [None]:
age[(33 < age) & (age <= 80)].count() / age.count()

## Insertion: Plotting with Pandas

Pandas offers a few Plotting-functions, which base on matplotlib's corresponding functions and internally call these functions themselves. To alter their behaviour, you can pass it an axis to a matplotlib-object: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.hist.html

In [None]:
titanic['Age'].hist(density=True)

In [None]:
plt.hist(titanic['Age'].dropna().values, density=True);

In [None]:
fig, ax = plt.subplots()
ax.hist(titanic['Age'].dropna().values, density=True)
ax.set_xlabel('Age');

In [None]:
fig, ax = plt.subplots()
titanic['Age'].hist(density=True, ax=ax)
ax.set_xlabel('Age');

In [None]:
titanic['Age_coarse'].value_counts()

In [None]:
titanic['Age_coarse'].value_counts().plot(kind='pie')

In [None]:
titanic['Age_coarse'] = titanic['Age_coarse'].cat.add_categories(['unknown'])

In [None]:
titanic['Age_coarse'].cat.categories

In [None]:
titanic['Age_coarse'] = titanic['Age_coarse'].fillna('unknown')

In [None]:
titanic['Age_coarse'].value_counts()

In [None]:
titanic['Age_coarse'].value_counts().plot(kind='pie')

In [None]:
titanic['Age_coarse'].value_counts().plot(kind='barh')

Note that even categories do not prevent you from stupid plots!

In [None]:
titanic['Age_coarse'].value_counts().plot(kind='kde')

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html

### Exercise
Plot the distribution of males and females in a barplot. Rotate the xticks of the resulting plot by 25° like in the MPL-Homework

In [None]:
count_down(3)

In [None]:
fig, ax = plt.subplots()
(titanic['Sex'].value_counts()/titanic['Sex'].count()).plot(kind='bar', ax=ax)
ax.set_xticklabels(titanic['Sex'], rotation=25, ha='right');

## Converting to numeric data
Sometimes numeric data is srewed up somehow. `pd.numeric` handles these cases, automatically casting everything to the appropriate type.

In [None]:
!cat data/numeric_data.csv

In [None]:
numeric_data = pd.read_csv('data/numeric_data.csv')
numeric_data

In [None]:
numeric_data.dtypes

In [None]:
numeric_data['C'].sum()

In [None]:
numeric_data['B'].astype('int')

In [None]:
numeric_data['A'].astype('float')

In [None]:
pd.to_numeric(numeric_data['A'], errors='ignore')

In [None]:
pd.to_numeric(numeric_data['A'], errors='coerce')

In [None]:
pd.to_numeric(numeric_data['B'], errors='coerce')

In [None]:
pd.to_numeric(numeric_data['C'], errors='coerce')

to_numeric only works on series, but luckily we can `apply`!

In [None]:
numeric_data

In [None]:
numeric_data.apply(pd.to_numeric, errors='coerce').dtypes #keyword-arguments are passed to the respective function

In [None]:
isinstance(np.nan, float)

### Optional Integer NA Support

Since Pandas 0.24 (Januar 2019), Pandas can finally hold integer dtypes with missing values:

In [None]:
pd.__version__

Remember how annoying a single missing value is for our DataFrame:

In [None]:
exam_data = {
    'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'],
    'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
    'attempts': [1, 3, np.nan, 3, 2, 3, 1, 1, 2, 1],
    }
pd.DataFrame(exam_data)

Pandas can represent integer data with possibly missing values using `arrays.IntegerArray`. This is an extension types implemented within pandas itself, not borrowed from numpy. It is not the default dtype for integers, and will not be inferred; you must explicitly pass the dtype into array() or Series:

In [None]:
tmp = pd.DataFrame(exam_data, dtype='Int8') #dtype=pd.Int8Dtype()
tmp

In [None]:
tmp['attempts'] += 0.01
tmp

Note that this corresponds to a new dtype, the `Nullable Integer Data Type`.

In [None]:
s = pd.Series([1, 2, np.nan], dtype='Int64')
df = pd.DataFrame({'A': s, 'B': [1, 1, 3], 'C': list('aab')})
df

In [None]:
df.dtypes

In [None]:
df.loc[1, 'B'] = np.
df

# Exploratory data analysis

Exploratory data analysis (EDA) describes the process of building up an intuition for our data. It is achieved through a combination of data transformations and visualizations. Typical steps in the process of EDA are:


1. Research the fields of the dataset 
2. Form hypotheses/develop investigation themes to explore 
3. Wrangle data 
3. Assess quality of data 
4. Profile data 
5. Explore each individual variable in the dataset 
6. Assess the relationship between each variable and the target 
7. Assess interactions between variables 
8. Explore data across many dimensions 

EDA is very important as we cannot judge whether our modeling makes sense if we don't have intuition for our data. While every analysis starts with EDA you will always return to it when you get new results from modeling.

Here we present pivot tables as an easy way to explore the relationships between variables.

## Pivot for analysis 
Last time we introduced pivot tables as a way to restructure untidy data. However, the original are an operation to create tabular summaries of data. They can be used as a convenient shortcut for a two dimensional groupby.

Let's look at a normal groupby first:

In [None]:
titanic.groupby('Sex').mean()



Let's say we want to analyze the the influnence of gender and passenger class on survial in the titanic dataset.

In [None]:
titanic.groupby(['Sex', 'Pclass'])['Survived'].mean()

Resetting the index makes it look a little nicer.

In [None]:
titanic.groupby(['Sex', 'Pclass'])['Survived'].mean().reset_index()

For people being used to the tidy format, this can in read intuitively. However, you might still prefer have the second variable in the column headers. This is what is called a "pivot table".

In [None]:
titanic.groupby(['Sex', 'Pclass'])['Survived'].mean().unstack()

For doing exactly this, pandas provides a shortcut.

In [None]:
titanic.pivot_table(values='Survived', index='Sex', columns='Pclass')

Pivot tables can also include the margins, i.e. the values aggregated over rows and columns.

In [None]:
titanic.pivot_table(values='Survived', index='Sex', columns='Pclass', margins=True)

By default `pivot_table` will aggregate using the mean, but we can also choose all the functions available in `groupby` or use our own ones.

In [None]:
titanic.pivot_table(values='Fare', index='Sex', columns='Pclass',
                   aggfunc=[min, max])

We can also aggregate multiple values in the same pivot table.

In [None]:
titanic.pivot_table(index='Sex', columns='Pclass',
                   aggfunc={'Survived': 'sum', 'Fare': 'max'})

Combining more than two variables is equally possible by stacking them in either the rows or the columns.

In [None]:
titanic.pivot_table(values='Fare', index=['Sex', 'EmbarkedLong'], columns='Pclass',
                   aggfunc='mean')

In [None]:
titanic['Age_coarse'] = pd.cut(titanic['Age'], bins=[0, 17, 67, 80], labels=['child', 'grown-up', 'senior'])

The tool [`pivottablejs`](https://github.com/nicolaskruchten/pivottable) allows you qickly explore data with pivotables using drag'n'drop. When using such a graphical tool you should make sure that you turn the interesting things into code so they don't get lost after closing the notebook.

In [None]:
from pivottablejs import pivot_ui
pivot_ui(titanic)

### Exercise

Use a pivot-table to check if [Women and children first](https://en.wikipedia.org/wiki/Women_and_children_first) was taken seriously on the Titanic. Is there a difference in the way the tool and Pandas calculate the margins?

In [None]:
count_down(3)

In [None]:
titanic.pivot_table(values='Survived', index='Age_coarse', columns='Sex', margins=True)

In [None]:
titanic.groupby('Age_coarse')['Survived'].mean()

## Profiling
When doing exploratory data analysis a lot have tasks have to be done every time, so they can be automated. Tools like `pandas_profiling` can create summeries that give insights into many standard questions you can ask to a dataset. However, with abstraction comes less flexibility so tools like this will always only do part of your work and might at times not at all do what you want.

In [None]:
from pandas_profiling import ProfileReport
ProfileReport(titanic)

The following tutorial talks more about tools and processes in exploratory data analysis.

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('W5WE9Db2RLU')

# Working with timeseries data
https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html  
The most fundamental building block of timeseries data in pandas is the `Timestamp`. It represents a moment in time to the precision of a nanosecond. It is complemented by `Timedelta`, which represents a stretch of time such as "one month", without being fixed to any date and `Period`, which is a combination of the two such as "June 2018". Yet `Period` needs to have a certain regularity, such as every month.


## Timestamps
Timestamps can be easily created from human readable strings using `pd.datetime`.

In [None]:
pd.to_datetime('2019-06-04')

In [None]:
pd.to_datetime('4th June 19')

In [None]:
pd.to_datetime('04.06.2019')

For non-Americans and people who think that the day should come before the month.

In [None]:
pd.to_datetime('04.06.2019', dayfirst=True)

In [None]:
pd.to_datetime('2019-06-04 14:45')

In [None]:
date = pd.to_datetime('2019-06-04 14:45:30.600700800')
date

`Timestamps` make all information available via attributes.

In [None]:
date.year

In [None]:
date.month

In [None]:
date.day

In [None]:
date.second

In [None]:
date.microsecond

In [None]:
date.nanosecond

Timestamps can be compared:

In [None]:
date1 = pd.to_datetime('2019-06-04 14:45')
date2 = pd.to_datetime('2019-06-04 14:46')
date1 < date2

When passed a Series, `to_datetime` returns a Series (with the same index), while a list-like is converted to a DatetimeIndex:

In [None]:
pd.to_datetime(pd.Series(['Jul 31, 2009', '2010-01-10', None]))

In [None]:
pd.to_datetime(['2005/11/23', '2010.12.31'])

### Timezones

In [None]:
from datetime import datetime
import pytz

In [None]:
time1 = pd.to_datetime('2019-05-21 13:30:38+00:00')
time2 = datetime.now()
time1 > time2

In [None]:
time1 = pd.to_datetime('2019-05-21 13:30:38')
time2 = datetime.now()
time1 > time2

In [None]:
time1 = pd.to_datetime('2019-05-21 13:30:38+00:00')
time1.tz

In [None]:
datetime.now().tzinfo

To check if a timestamp is offset-naive (does not contain a timezone):

In [None]:
time1.tzinfo is None or time1.tzinfo.utcoffset(time1) is None

In [None]:
datetime.now(pytz.timezone('Europe/Berlin')).tzinfo

In [None]:
time1 = pd.to_datetime('2019-05-23 12:20:38+00:00')
time2 = datetime.now(pytz.timezone('Europe/Berlin'))
time1 > time2

In [None]:
time1 = time1.tz_convert('Europe/Berlin')
time1

In [None]:
time1 = pd.to_datetime('2019-05-23 12:20:38', utc=True).tz_convert('Europe/Berlin')
time1

In [None]:
time1 = pd.to_datetime('2019-05-23 12:20:38', utc=True).tz_localize('Europe/Berlin')

In [None]:
time1 = pd.to_datetime('2019-05-23 12:20:38').tz_localize('Europe/Berlin')
time1

In [None]:
time2 = datetime.now(pytz.timezone('Europe/Berlin'))
time1 > time2

In [None]:
start = pd.to_datetime('2015-02-24')
rng = pd.date_range(start, periods=4)

df = pd.DataFrame({'Date': rng, 'a': range(4)})  
df

In [None]:
df.Date = df.Date.dt.tz_localize('UTC').dt.tz_convert('Asia/Kolkata') #just like there's a .str-accessor for strings, there's a .dt-accessor for datetimes
df

### Datetime Compatibility

For those of you used to thinking in unix-time (https://www.unixtimestamp.com/ ):

In [None]:
pd.to_datetime(datetime.now()).value

In [None]:
import time
time.mktime(pd.to_datetime(datetime.now()).timetuple())

`Timestamps` can be formatted using a special set of symbols. All of them can be found here https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior

In [None]:
date.strftime('Today is %A')

In [None]:
pd.to_datetime('12-11-2010 00:00', format='%d-%m-%Y %H:%M')

### Exercise 

Write a function that recieves a datetime and shows a text like `That day is Tuesday, the 21st of May 2019`

In [None]:
from datetime import datetime

SUFFIXES = {1: 'st', 2: 'nd', 3: 'rd'}
def ordinal(num):
    if 10 <= num % 100 <= 20:
        suffix = 'th'
    else:
        suffix = SUFFIXES.get(num % 10, 'th')
    return str(num) + suffix

tell_day(datetime.now())

In [None]:
count_down(3)

In [None]:
def tell_day(dt):
    print(dt.strftime('That day is %A, the'), ordinal(int(dt.strftime('%d'))), dt.strftime('of %B %Y'))

### DatetimeIndex

Timestamps can be used to index data.

In [None]:
index = pd.DatetimeIndex(['2018-06-04', '2018-06-11',
                          '2018-06-18', '2018-06-25',
                          '2018-06-02'])
schedule = pd.Series(['Analyzing Data with Pandas', 'Creating Experiments', 
                      'Statistical Modeling', 'Statistical Visualization',
                  'Interactive Plotting'], index=index)
schedule

In [None]:
schedule['2018-06-10':'2018-06-30']

Just like there's NaN for numbers, there's NaT (Not-A-Time) for timestamps:

In [None]:
dt = pd.to_datetime(['2009/07/31', 'asd'], errors='coerce')
dt

`isnull()` checks for missing dates in DatetimeIndex-objects (NaN in numeric arrays, None or NaN in object arrays, NaT in datetimelike):

In [None]:
dt.isnull()

## Timedeltas and Periods

Timedeltas can be added to Timestamps

In [None]:
delta = pd.to_timedelta('1 day')
delta

In [None]:
schedule.index += delta
schedule

In [None]:
from datetime import datetime
datetime.now()

In [None]:
datetime.now() - pd.to_datetime('2018-06-04')

In [None]:
schedule.index += (datetime.now() - pd.to_datetime('2018-06-05'))
schedule.index = schedule.index.date
schedule

The combination of Timestamps and Timedeltas allows for nice arithmetics with dates:

In [None]:
friday = pd.Timestamp('2018-01-05')
saturday = friday + pd.to_timedelta('1 day')
saturday, saturday > friday, saturday - friday

There are even Businessdays in Pandas (Friday --> Monday)

In [None]:
friday = pd.Timestamp('2018-01-05')
monday = friday + pd.offsets.BDay()
monday

### date_range

In [None]:
index = pd.DatetimeIndex(['2018-06-04', '2018-06-11',
                          '2018-06-18', '2018-06-25',
                          '2018-06-02'])
schedule = pd.Series(['Analyzing Data with Pandas', 'Creating Experiments', 
                      'Statistical Modeling', 'Statistical Visualization',
                  'Interactive Plotting'], index=index)
schedule

A more convenient way to create such an index is to use `date_range`.  
`periods` specifies how many entries we want, alternatively we could set an explicit `stop`. `freq` specifies how the entries are spaced. The full list of possible offsets can be found here http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases. Thus the syntax is very similar to `range(start, stop, step)`

In [None]:
index = pd.date_range('2018-06-04', periods=5, freq='W')
index

Note that `freq='W'` does not mean a simple weekly frequency, but rather `the end of the week for all these dates`.

In [None]:
index = pd.date_range('2018-06-04', periods=5, freq='7D')
index

Pandas is smart at inferring frequencies:

In [None]:
tmp = pd.DatetimeIndex(['2018-01-01', '2018-01-03', '2018-01-05'], freq='infer')
tmp

In [None]:
ts = pd.Series(range(len(tmp)), index=tmp)
ts

In [None]:
ts.resample('D').sum().index

Alternatively we could use a `Period` index to signal that a topic belongs to an entire week.

In [None]:
prd = pd.Period('2018-06-04')
prd

In [None]:
prd.freq

In [None]:
prd += pd.to_timedelta('1 day')
prd

In [None]:
index = pd.period_range('2018-06-04', periods=5, freq='W')
schedule = pd.Series(['Analyzing Data with Pandas', 'Statistical Visualization', 
                  'Statistical Modeling', 'Creating Experiments',
                  'Other tools and libraries'], index=index)
schedule

You can easily convert between `Timestamp` and period.

In [None]:
schedule = schedule.to_timestamp()
schedule

In [None]:
schedule.to_period(freq='W')

In [None]:
prd

In [None]:
prd.to_timestamp().to_period(freq='2D')

### Insertion: Accessing values in Series

In [None]:
idx = pd.period_range('2000', periods=4)
idx

For Series and Indexes backed by normal NumPy arrays, Series.array will return a new arrays.PandasArray, which is a thin (no-copy) wrapper around a numpy.ndarray. PandasArray isn’t especially useful on its own, but it does provide the same interface as any extension array defined in pandas or by a third-party library.

In [None]:
idx.array

In [None]:
pd.Series([1, 2, 3]).array

In [None]:
idx.to_numpy()

In [None]:
type(idx.to_numpy()[0])

## Reading timeseries data

In [None]:
import matplotlib
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

matplotlib.style.use('seaborn')
matplotlib.rcParams['figure.figsize'] = 12, 8

In [None]:
!head data/ao_monthly.txt

In [None]:
ts = pd.read_fwf('data/ao_monthly.txt', header=None, index_col=0)
ts.head()

This creates an integer index instead of the desired `DateTimeIndex`.

In [None]:
ts.index

In [None]:
ts = pd.read_fwf('data/ao_monthly.txt', header=None, index_col=0,
                parse_dates=[[0, 1]], infer_datetime_format=True)
ts.head()

In [None]:
ts.index

In [None]:
ts.plot()

Now that our series is indexed by timestamps, we can aggregate using time related semantics.

In [None]:
ts.index.year

In [None]:
ts.groupby(ts.index.year).mean().head()

In [None]:
ts.groupby(ts.index.year).mean().plot(marker='o')

Using `pd.Grouper` we can specify more complex groupings.

In [None]:
ts.groupby(pd.Grouper(freq='5Y')).mean().head()

In [None]:
ts.groupby(pd.Grouper(freq='5Y')).mean().index[0]

In [None]:
ts.groupby(pd.Grouper(freq='5Y')).mean().head()

In [None]:
ts.groupby(pd.Grouper(freq='d')).mean().head()

### Resampling
If you do not like the frequency at which your data is sampled you can change the sampling frequency.

In [None]:
nineteenfifty = ts['1950']
nineteenfifty.head()

In [None]:
nineteenfifty.plot(marker='o')

In [None]:
nineteenfifty.asfreq('12D', method='ffill').head()

In [None]:
nineteenfifty.plot(style='--o')

In [None]:
fig, ax = plt.subplots(nrows=2, sharex=True)

nineteenfifty.asfreq('12D').plot(ax=ax[0], style='-o')
nineteenfifty.asfreq('12D', method='ffill').plot(ax=ax[1], marker='o')
nineteenfifty.asfreq('12D', method='bfill').plot(ax=ax[1], style='--o', label='backward')
nineteenfifty.plot(ax=ax[1], style='o')
ax[0].legend(['no fill']);
ax[1].legend(['forward-fill', 'back-fill', 'original']);

Downsampling can be done by specifying a smaller frequency.

In [None]:
nineteenfifty.asfreq('3M', method='ffill').plot(marker='o')

In [None]:
fig, ax = plt.subplots()
nineteenfifty.asfreq('3M', method='ffill').plot(marker='o', ax=ax)
nineteenfifty.plot(ax=ax, style='o')
ax.legend(['3 Month', 'original']);

In [None]:
#last = nineteenfifty.iloc[-1]
#last.name = last.name.to_period(freq='M').to_timestamp(how='E')

Resampling can also be combined with aggregation using `resample`.
Let's look at some stock data to illustrate this.

In [None]:
yahoo = pd.read_csv('data/yahoo_stock.csv', index_col=0, parse_dates=True)
yahoo.head()

In [None]:
ts = yahoo['Close']
ts.plot()

http://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects

In [None]:
ts.plot(alpha=0.5, style='-')
ts.resample('BA').mean().plot(style=':')
ts.asfreq('BA').plot(style='--');
plt.legend(['input', 'resample', 'asfreq'],
           loc='upper left');

### Shifting and Differencing
Shifting data in time can be done in two ways. `shift` actually moves the data. Creating missing values at the one end and losing data at the other. In contrast, `tshift` only shifts the time index of the data and not the data itself.

In [None]:
fig, ax = plt.subplots(3, sharey=True)

# apply a frequency to the data
ts_resampled = ts.asfreq('D', method='ffill')

ts.plot(ax=ax[0])
ts_resampled.shift(365).plot(ax=ax[1])
ts_resampled.tshift(365).plot(ax=ax[2])

# legends and annotations
local_max = pd.to_datetime('2011-11-05')
offset = pd.Timedelta(365, 'D')

ax[0].legend(['input'], loc=2)
ax[0].get_xticklabels()[2].set(weight='heavy', color='red')
ax[0].axvline(local_max, alpha=0.3, color='red')

ax[1].legend(['shift(365)'], loc=2)
ax[1].get_xticklabels()[2].set(weight='heavy', color='red')
ax[1].axvline(local_max + offset, alpha=0.3, color='red')

ax[2].legend(['tshift(365)'], loc=2)
ax[2].get_xticklabels()[2].set(weight='heavy', color='red')
ax[2].axvline(local_max + offset, alpha=0.3, color='red');
plt.tight_layout()

Shifting is useful for calculations that compare values across timesteps. An example is differencing to remove trend in the timeseries.

In [None]:
(ts_resampled - ts_resampled.shift(periods=1)).plot()

For differencing, pandas provides the convenient `diff` method.

In [None]:
ts_resampled.diff(periods=1).plot()

## Window functions
Window functions are similar to `groupby` as they split data into different groups based on a changing window. The points in each window are aggregated using a summary statistic and then combined back into a timeseries.

### Rolling window
A rolling window is the standard example of a window function. It moves a window of fixed size across the timeseries.

In [None]:
ts_resampled.plot()
ts_resampled.rolling(365).mean().plot()

Setting `center=True` the point that is aggregated and put into the new series will be from the middle of the window and not from its end. 

In [None]:
ts_resampled.plot()
ts_resampled.rolling(365, center=True).mean().plot()

### Expanding windows
An expanding window only has a minimal size. Then it grows bigger with each step, taking all previous values into account. This is useful if your timeseries measures a stationary value that only fluctuates around a mean.

In [None]:
ts_resampled.plot()
ts_resampled.expanding(min_periods=365).mean().plot()

### Exponential weighted windows
An exponential weighted window works like an expanding window, but gives more recent datapoints an exponentially higher weighting in all calculations. Thus it can be viewed as a smooth version of a rolling window.

In [None]:
ts_resampled.plot()
ts_resampled.ewm(com=50.5, min_periods=5).mean().plot()

First info regarding anything to do with timeseries: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html

Further:  
A complete time series analysis tutorial. It includes handling time zones plus basic time series prediction and classification.

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('zmfe2RaX-14')