# Date & time in pandas

Topics we will cover:
* Converting datetime columns
* Date and time as index
* Computing statistics using dates & time
* Shift
* Resampling
* .dt accessor


For this notebook we will use some data we downloaded from data.amsterdam.nl, which is a super useful website for getting open source data about Amsterdam! It's about the airquality in Amsterdam.

We have downloaded data from 01/01/2019 until 04/11/2019 for the location Amsterdam Vondelpark. It contains data about 7 airquality components: carbon monoxide (CO), nitric oxide (NO), ozone (O3),  fluorine nitrogen (FN), particulate matter (PM10 and PM25), nitrogen dioxide (NO2). It also contains a column called 'airquality_index' which is an index from 1 to 11 representing the public health impact of air pollution (1 = low risk, 11 = very high risk).

Link to data:
https://data.amsterdam.nl/datasets/Jlag-G3UBN4sHA/luchtkwaliteit/

More information on the data and airquality:
https://www.luchtmeetnet.nl/uitleg

In [None]:
### Steps for use with colab
# First step to mount your Google Drive
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/My\ Drive
# Clone Pyladies repo 
#! git clone --recursive https://github.com/pyladiesams/Pandas-advanced-nov2019.git
# Install requirements
! pip install pandas==0.25.3
import pandas as pd
# Move into repo
%cd /content/drive/My\ Drive/Pandas-advanced-nov2019/workshop/

## 0. Load data

In [1]:
import datetime
import matplotlib
import pandas as pd

%matplotlib inline

In [2]:
airquality = pd.read_csv("./data/airquality.csv", delimiter=";", decimal=",")

# rename columns from Dutch to English
airquality.columns = ["time", "location", "component", "value", "airquality_index"]

In [3]:
airquality.shape

(45459, 5)

In [4]:
airquality.head(2)

Unnamed: 0,time,location,component,value,airquality_index
0,2019-01-01 01:00:00+01:00,Amsterdam-Vondelpark,CO,298.1,2
1,2019-01-01 01:00:00+01:00,Amsterdam-Vondelpark,NO,5.2,1


We can already foresee a problem here looking at the time column, namely that it appears to have a timezone

In [5]:
airquality.dtypes

time                 object
location             object
component            object
value               float64
airquality_index      int64
dtype: object

Note that the time column is inferred as an 'object' column by pandas

## 1. Converting timestamp columns

### 1.1 String to timestamp

In [128]:
# Try to convert the 'time' column to a datetime column using pd.to_datetime

## your code here ##

You could get the following error:

`ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True`

You can read more about timezones & pandas here: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#time-zone-handling

Let's try to convert the time column again, now using utc=True. Also, let's do it in a nicer way by specifying a format. In addition to the required datetime string, you can pass a format argument to ensure specific parsing. This could also potentially speed up the conversion considerably.

You can find an overview of formats here: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior

In [1]:
time_col = ["time"]

def convert_to_datetime(col, fmt = # specify the format here #):
    return ## Your code here ##

airquality[time_col] = airquality[time_col].apply(## Your code here ##)

print(airquality.dtypes)

airquality.head(2)

Great! Now we have our time column in UTC time and the type of column is datetime64. Pandas datetime columns are always of type `datetime64[ns]` or `datetime64[ns, tz]`.

In [9]:
# You can easily convert to other timezones in this way:
airquality.time.dt.tz_convert('US/Pacific')

In [8]:
# Try to remove the timezone information completely.
# Save the dataframe and move on to the next section!

## Your code here ##

print(airquality.head(2))

### 1.2 Datetime to string

There might be reasons for you to want to convert from timestamp back to string. In that case you can use **datetime.strftime**.

In [32]:
# make a copy of the dataframe so we do not alter our original one.
airquality_temp = airquality.copy()

airquality_temp["time"] = airquality_temp["time"].apply(# your code here #)

# check results
print(airquality_temp.head(2))
print(airquality_temp.dtypes)

### 1.3 Handling errors

Several errors can happen when you try to convert a column to datetime.
Pandas lists three ways to deal with them:
- `errors='raise'` raise when unparseable
- `errors='ignore'` return original input when unparseable
- `errors='coerce'` convert unparseable data to `NaT` (not a time)

Let's add some errors to our data by changing the timestamp of the first row to a string.

In [None]:
airquality_errors = airquality.copy()
airquality_errors.time.iloc[0] = "not a time"
print(airquality_errors.iloc[0])

In [7]:
# now try to convert the time column, and try some different options for the errors.

## Your code here ##

In [8]:
# now try to convert a column with numbers, say 'value', and see what happens

## Your code here ##

That's pretty odd! No matter what error option we specify, the column will still be converted to unix timestamp!


The main take-away is that converting columns to datetime can be tricky, and you should always be careful when doing so. It's important to check you columns before and after, to make sure the converting succeeded.

### 1.4 Timestamp to epoch

As we saw before, Pandas can convert a column to unix timestamp. The unix time stamp is a way to track time as a running total of seconds since the epoch. Pandas stores timestamp with nanoseconds resolution. The epoch is defined as January 1st, 1970 at UTC.

In [None]:
# convert our column 'time' from timestamp to unix time
# 1) make a new column 'time_unix', 
# 2) cast the timestamp to integer using .astype(int), 
# 3) convert the column from nanoseconds to seconds

# your code here #

print(airquality[["time", "time_unix"]].sample(3))

### 1.5 Creating timestamps

You can create your own Pandas timestamps in several ways

In [10]:
# Use `datetime.datetime` and `pd.Timestamp` to create a Pandas timestamp
pd.Timestamp(datetime.datetime(2019,11,28))

In [11]:
# Use plain strings
pd.Timestamp("2018-01-05")

In [12]:
# Or just use `pd.Timestamp` as is
pd.Timestamp(2012, 5, 1)

In [13]:
# Generate sequences of fixed-frequency dates and time spans
pd.date_range('2018-01-01', periods=3, freq='H')

In [14]:
pd.period_range('1/1/2011', freq='M', periods=3)

Try to create some example dataframes with different indexes, which you create by using the options above!

In [15]:
## Your code here ##

## 2. Use datetime as index

Now that we succesfully converted our column to datetime we can do some useful stuff with it! One of which is to set our datetime column as the index of our dataframe. Pros according to Pandas:
* A large range of dates for various offsets are pre-computed and cached under the hood in order to make generating subsequent date ranges very fast (just have to grab a slice).
* Fast shifting using the `shift` and `tshift` method on pandas objects.
* Unioning of overlapping `DatetimeIndex` objects with the same frequency is very fast (important for fast data alignment).
* Quick access to date fields via properties such as `year`, `month`, etc.
* Regularization functions like `snap` and very fast `asof` logic.


In [27]:
# Make a copy of our origin dataframe
# Filter the dataframe on component == "CO"
# Set the datetime column as an index of your dataframe
airquality_indexed = airquality.copy()
airquality_indexed = airquality_indexed.loc[airquality_indexed.component == "CO"]
airquality_indexed = airquality_indexed.set_index("time")

In [20]:
print(airquality_indexed.sample(2))

In [21]:
# you can use one date to index
airquality_indexed.loc[datetime.datetime(2019,1,1),:]

In [22]:
# you can also use strings
airquality_indexed.loc["2019-01-01"]

In [23]:
# you can also slice the dataframe
airquality_indexed.loc[:datetime.datetime(2019,1,7)]

In [24]:
airquality_indexed.loc["2019-01-03":"2019-01-07"]

In [25]:
# you can also pass in the year or month
airquality_indexed.loc["2019-1"]

In [26]:
# you can also use the feature 'truncate'
airquality_indexed.truncate(before='2019-02-01', after='2019-03-01')

Try to create some plots of the amount of 'CO', for:
* Month of January
* 1st of July
* Before 1st of February but after 1st of September

In [None]:
## Your code here ##

## 3. Computing statistics

In [None]:
# you can do simple operations like adding and substraction
print(
    airquality_indexed.loc["2019-01-01 00:00:00"].value - \
    airquality_indexed.loc["2019-01-01 01:00:00"].value
)

In [None]:
# calculate the difference between every hour
airquality_indexed.loc["2019-01-01"].value.diff()

In [None]:
# summing
airquality_indexed.loc["2019-01-01"].value.sum()

In [16]:
# take the rolling average of 2 days
ozone.rolling("2D").mean().head(2)

In [17]:
# expanding is useful for calculating the cumulative sum
ozone.expanding().sum()

Unnamed: 0_level_0,location,component,value,airquality_index
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2019-01-01 00:00:00,Amsterdam-Vondelpark,O3,30.7,3
2019-01-01 01:00:00,Amsterdam-Vondelpark,O3,48.2,4


Plot the difference between the hours of January 1st

In [None]:
## Your code here

Try to plot the rolling mean per week for the month of March

In [None]:
## Your code here

## Shift

With `pandas.DataFrame.shift` you can shift the index by a desired number of periods.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shift.html

Try adding a column to the dataframe with the value of 'CO' of the previous hour.
Then plot the two lines!

In [None]:
## Your code here

## Resampling

A convenient method for conversion the frequency of your dataframe is `resampling`.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.resample.html

In [29]:
# Try to resample the dataframe to a daily frequency by taking the mean.
airquality_indexed.resample("1D").mean()

In [None]:
# plot the daily values of 'CO' for January
## Your code here

Besides resampling to a lower frequency (downsampling), you can also resample to a higher frequency (upsampling).

In [30]:
# Resample the data to a half-hour frequency, using forward filling
airquality_indexed.resample("0.5H").ffill()

In [None]:
# you can also interpolate the missing datapoints by using .interpolate
airquality_indexed.resample("0.5H").interpolate(method="linear")

Resample the data of January to a half-hour frequency and plot the result using method="linear" and also try other methods such as "spline"

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.interpolate.html

In [31]:
## Your code here

## .dt accessor

You can use the `.dt` accessor to extract information from the date

Overview of all .dt accessors: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#time-date-components

In [32]:
# Add three columns to the airquality dataframe for day, day of the week, and hour.

## Your code here

In [34]:
print(airquality.sample(3))

It allows you to do easy filtering as well, for example, select only datapoints where hour == 2

In [37]:
## Your code here

# The end

That's the end of this notebook!
Feel free to play around some more with the data. Some ideas of what you can do:
- Try to fit a model on the data of January and February and predict the level of 'CO' for the month of March
- Try adding features to the model such as the day of the week, and the hour of the day and see if your model becomes better
- Make some very creative plots of the different components over different time periods!