# Intro

Pandas has first-class support for datetime types, including flexible indexing, vectorized operations, `groupby` operations and joins. This makes EDA on time series data with Pandas very convenient and productive.

# NOTEBOOK WON'T WORK, CSV FILE IS TO BIG TO UPLOAD TO GIT

In [1]:
%pylab inline
plt.style.use('bmh')

import pathlib
import numpy as np
import pandas as pd
import seaborn as sns
sns.set()

Populating the interactive namespace from numpy and matplotlib


In [2]:
DATA_DIR = pathlib.Path("./")

# Loading data

The dataset we'll use to explore time series functionality in Pandas is [1.6 million UK traffic accidents](https://www.kaggle.com/daveianhickey/2000-16-traffic-flow-england-scotland-wales). The full dataset contains years 2005-2007, but note that in the Exam we only use 2005.

In [4]:
d = pd.read_csv(DATA_DIR.joinpath('accidents_2005_to_2007.csv.zip'))

  d = pd.read_csv(DATA_DIR.joinpath('accidents_2005_to_2007.csv.zip'))


Dataset is quite large. Let's explore it's per-column breakdown:

In [5]:
d.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 570011 entries, 0 to 570010
Data columns (total 33 columns):
 #   Column                                       Non-Null Count   Dtype  
---  ------                                       --------------   -----  
 0   Accident_Index                               570011 non-null  object 
 1   Location_Easting_OSGR                        569910 non-null  float64
 2   Location_Northing_OSGR                       569910 non-null  float64
 3   Longitude                                    569910 non-null  float64
 4   Latitude                                     569910 non-null  float64
 5   Police_Force                                 570011 non-null  int64  
 6   Accident_Severity                            570011 non-null  int64  
 7   Number_of_Vehicles                           570011 non-null  int64  
 8   Number_of_Casualties                         570011 non-null  int64  
 9   Date                                         570011 non-nul

As we can see, it has date and time in separate columns, and we need to combine it into full datetime:

In [None]:
d.Date.head()

In [None]:
d.Time.head()

Let's explore if we have any missing dates or times:

In [None]:
d[d.Date.isnull()]

In [None]:
d[d.Time.isnull()]

We may note the following:
    
- date and time are provided as strings,
- we have slashes in dates, and this can be parsed ambiguously,
- some times are missing.

Hence, our strategy is the following:

- concatenate date and time using string vectorized operations,
- set placeholder for missing times to be `00:00`,
- parse resulting (**string**) datetime with explicit `dayfirst=True`.

In [None]:
d.loc[:, 'dt'] = d.Date.str.cat(d.Time, sep=' ', na_rep='00:00')

In [None]:
d.dt

In [None]:
d.loc[:, 'date_time'] = pd.to_datetime(d.dt, dayfirst=True)

We now have `date_time` column of type `datetime64[ns]`:

In [None]:
d.info()  # Note the difference without `memory_usage="deep"`

Let's filter some columns we do not need:

In [None]:
d.columns

In [None]:
COLS = ['Accident_Index', 'Longitude', 'Latitude',
        'Accident_Severity', 'Number_of_Vehicles',
        'Number_of_Casualties', 'Weather_Conditions',
        'Day_of_Week', 'Road_Surface_Conditions',
        'Special_Conditions_at_Site', 'Urban_or_Rural_Area',
        'Carriageway_Hazards', 'date_time']

In [None]:
d.drop([c for c in d.columns if c not in COLS], axis=1, inplace=True)

Pandas has a dedicated set of index types for datetime indexes:

In [None]:
d.set_index('date_time', inplace=True)

In [None]:
d.index

In [None]:
d.head()

# `DatetimeIndex` in details

`DatetimeIndex` is special in many ways. It allows for much more flexible indexing compared to usual indexes. First of all, you can use strings, not just actual index labels. To leverage this, we first sort the index:

In [None]:
d.sort_index(inplace=True)

We can now use strings to index the dataframe (indexing non-monotonic `DatetimeIndex` with strings is not a very good idea):

In [None]:
d["2006-02-12 20":"2006-03"]

In [None]:
d["2006":]

Note, how Pandas allows for partial datetime string specification. Of course, this way of indexing can be combined with column index:

In [None]:
d.loc["2005", "Accident_Severity"]

We will now create a dataframe used in Problem 6 of the exam:

In [None]:
d.loc["2005", "Accident_Severity"].to_csv(DATA_DIR.joinpath("accidents_2005.csv"))

In [None]:
accidents_2005 = pd.read_csv(DATA_DIR.joinpath("accidents_2005.csv"),
                             parse_dates=["date_time"])

In [None]:
accidents_2005.head()

Note, that it's not indexed, and that's exactly the way it's passed to the solution function.

# Resampling time series

Time series in Pandas can be easily resampled to any frequency:

In [None]:
d.resample('D')

Similar to `groupby`, `resample` doesn't perform any operations on it's own, but just calculates which rows go to which (datetime) bin. We need to further apply some aggregation operation. For example, we may calculate number of accidents per day:

In [None]:
daily = d.resample('D').size()
daily

`daily` has `DatetimeIndex` as well and has `freq` specified (as it was constructed to have one):

In [None]:
daily.index

In [None]:
daily.index.is_monotonic, daily.index.is_unique

Pandas also exposes plotting functionality to datetime-indexed dataframes. To illustrate this, let's plot daily and weekly average number of accidents:

In [None]:
# Just a hint: you can set image resolution in dpi
plt.figure(figsize=(8,3), dpi=150)  

daily.plot(ax=plt.gca(), linewidth=0.5)

(d.resample('W').size()/7.).plot(ax=plt.gca(),
                                 linewidth=1,
                                 color='firebrick')

plt.ylabel('average daily accidents')
plt.xlabel('week');

In EDA terms, we just gained our first insight: accidents are strongly seasonal (with non-trivial seasonal struture and high dependence on holidays).

Similarly, we can plot daily and weekly average number of vehicles involved:

In [None]:
plt.figure(figsize=(12,5), dpi=150)

d.resample('D').Number_of_Vehicles.mean().plot(ax=plt.gca())
d.resample('W').Number_of_Vehicles.mean().plot(ax=plt.gca(), color='firebrick')

plt.ylabel('vehicles involved')
plt.xlabel('week');

Instead of `resample`, we can use `pd.Grouper`. It's not really useful as a replacement of `resample`, but is very handy in compound grouping keys.

In [None]:
d.groupby(pd.Grouper(freq='D'))["Number_of_Casualties"].mean()

Now, let's try to extract the accidents, which have more casualties, than average number of casualties on that day. And now Pandas datetime magic comes into play:

In [None]:
daily_casualties = (d
                    .groupby(pd.Grouper(freq='D'))["Number_of_Casualties"]
                    .mean())

df = d.merge(daily_casualties,
             left_on=d.index.floor("1D"),
             right_index=True,
             suffixes=("", "_daily"))

Note that Pandas keeps the calculated key it used for merging as `key_0`:

In [None]:
df.head()

We do not need it at the moment, so we'll drop it:

In [None]:
df.drop("key_0", axis=1, inplace=True)

We can now calculate how extreme each accident is compared to daily averages:

In [None]:
df["delta"] = df["Number_of_Casualties"] - df["Number_of_Casualties_daily"]

In [None]:
df.loc["2005"].sort_values(by="delta", ascending=False)

Let's explore the most extreme one:

In [None]:
d[d.Accident_Index=="200597EC70504"]

You may want to further investigate this case (with Google of course).

Let's get back to `pd.Grouper` and compound keys. We can flexibly combine grouper on datetime index with usual column. Let's calculate now many accidents we have per area type each day:

In [None]:
d.groupby([pd.Grouper(freq='1M'), 'Urban_or_Rural_Area']).size()

Now we can plot this as a stacked bar plot:

In [None]:
plt.figure(figsize=(12,6))

(d.groupby([pd.Grouper(freq='1M'), 'Urban_or_Rural_Area'])
 .size()
 .unstack()
 .plot(alpha=0.6, linewidth=2, ax=plt.gca(), kind='bar', stacked=True));

We can do the same with accident severity:

In [None]:
plt.figure(figsize=(12,6))

(d.groupby([pd.Grouper(freq='1M'), 'Accident_Severity'])
 .size()
 .unstack()
 .plot(alpha=0.6, linewidth=2, ax=plt.gca(), kind='bar', stacked=True));