# Dealing with Dates


In the last two notebooks, we learned a variety of methods to text character and numeric data, but many data sets also contain dates that don't fit nicely into either category. Common date formats contain numbers and sometimes text as well to specify months and days. Getting dates into a friendly format and extracting features of dates like month and year into new variables can be useful preprocessing steps.

In [477]:
import numpy as np
import pandas as pd

In [478]:
dates = pd.read_csv('dates.csv')

In [479]:
dates # Check the dates

Unnamed: 0,month_day_year,day_month_year,date_time,year_month_day
1,4/22/1996,22-Apr-96,Tue Aug 11 09:50:35 1996,2007-06-22
2,4/23/1996,23-Apr-96,Tue May 12 19:50:35 2016,2017-01-09
3,5/14/1996,14-May-96,Mon Oct 14 09:50:35 2017,1998-04-12
4,5/15/1996,15-May-96,Tue Jan 11 09:50:35 2018,2027-07-22
5,5/16/2001,16-May-01,Fri Mar 11 07:30:36 2019,1945-11-15
6,5/17/2002,17-May-02,Tue Aug 11 09:50:35 2020,1942-06-22
7,5/18/2003,18-May-03,Wed Dec 21 09:50:35 2021,1887-06-13
8,5/19/2004,19-May-04,Tue Jan 11 09:50:35 2022,1912-01-25
9,5/20/2005,20-May-05,Sun Jul 10 19:40:25 2023,2007-06-22


When you load data with Pandas, dates are typically loaded as strings by default. Let's check the type of data in each column:

In [480]:
for col in dates:
    print (type(dates[col][1]))

<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>


The output confirms that all the date data is currently in string form. To work with dates, we need to convert them from strings into a data format built for processing dates. The pandas library comes with a Timestamp data object for storing and working with dates. 

You can instruct pandas to automatically convert a date column in your data into `Timestamps` when you read your data by adding the "`parse_dates`" argument to the data reading function with a list of column indices indicated the columns you wish to convert to `Timestamps`. 

However, sometimes that function does not work well. We will convert the columns to `Timestamp` using the function `pd.to_datetime()`.

In [481]:
dates['month_day_year'] = pd.to_datetime(dates['month_day_year'])
dates['day_month_year'] = pd.to_datetime(dates['day_month_year'])
dates['date_time'] = pd.to_datetime(dates['date_time'])
dates['year_month_day'] = pd.to_datetime(dates['year_month_day'])

In [482]:
for col in dates:
    print (type(dates[col][1]))

<class 'pandas.tslib.Timestamp'>
<class 'pandas.tslib.Timestamp'>
<class 'pandas.tslib.Timestamp'>
<class 'pandas.tslib.Timestamp'>


If you have oddly formatted date time objects, you might have to specify the exact format to get it to convert correctly into a `Timestamp`. For instance, consider a date format that gives date times of the form `hour:minute:second year-day-month`:

In [483]:
odd_date = "12:30:15 2015-29-11"

The default `to_datetime` parser will fail to convert this date because it expects dates in the form `year-month-day`. In cases like this, specify the date's format to convert it to `Timestamp`:

In [484]:
pd.to_datetime(odd_date,
               format= "%H:%M:%S %Y-%d-%m") 

Timestamp('2015-11-29 12:30:15')

As seen above, date formatting uses special formatting codes for each part of the date. For instance, `%H` represents hours and `%Y` represents the four digit year. View a list of formatting codes [here](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior).

Once you have your dates in the `Timestamp` format, you can extract a variety of properties like the year, month and day. Converting dates into several simpler features can make the data easier to analyze and use in predictive models. Access date properties from a `Series` of `Timestamp`s with the syntax: `Series.dt.property`. To illustrate, let's extract some features from the first column of our date data and put them in a new `DataFrame`:

In [485]:
column_1 = dates.ix[:,0]

pd.DataFrame({"year": column_1.dt.year,
              "month": column_1.dt.month,
              "day": column_1.dt.day,
              "hour": column_1.dt.hour,
              "dayofyear": column_1.dt.dayofyear,
              "week": column_1.dt.week,
              "weekofyear": column_1.dt.weekofyear,
              "dayofweek": column_1.dt.dayofweek,
              "weekday": column_1.dt.weekday,
              "quarter": column_1.dt.quarter,
             })

Unnamed: 0,day,dayofweek,dayofyear,hour,month,quarter,week,weekday,weekofyear,year
1,22,0,113,0,4,2,17,0,17,1996
2,23,1,114,0,4,2,17,1,17,1996
3,14,1,135,0,5,2,20,1,20,1996
4,15,2,136,0,5,2,20,2,20,1996
5,16,2,136,0,5,2,20,2,20,2001
6,17,4,137,0,5,2,20,4,20,2002
7,18,6,138,0,5,2,20,6,20,2003
8,19,2,140,0,5,2,21,2,21,2004
9,20,4,140,0,5,2,20,4,20,2005


In addition to extracting date features, you can use the subtraction operator on `Timestamp` objects to determine the amount of time between two different dates:

In [486]:
print(dates.ix[1,0])
print(dates.ix[3,0])
print(dates.ix[3,0]-dates.ix[1,0])

1996-04-22 00:00:00
1996-05-14 00:00:00
22 days 00:00:00


Pandas includes a variety of more advanced date and time functionality beyond the basics covered in this lesson, particularly for dealing time series data (data consisting of many periodic measurements over time.). Read more about date and time functionality [here](http://pandas.pydata.org/pandas-docs/stable/timeseries.html).

## Wrap Up

Pandas makes it easy to convert date data into the `Timestamp` data format and extract basic date features like day of the year, month and day of week. Simple date features can be powerful predictors because data often exhibit cyclical patterns over different time scales.

Cleaning and preprocessing numeric, character and date data is sometimes all you need to do before you start a project. In some cases, however, your data may be split across several tables such as different worksheets in an excel file or different tables in a database. In these cases, you might have combine two tables together before proceeding with your project. In the next notebook, we'll explore how to merge data sets.