<a href="https://colab.research.google.com/github/rahiakela/introduction-to-time-series-forecasting-with-python/blob/part-1-data-preparation/1_load_and_explore_time_series_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Load and Explore Time Series Data

The Pandas library in Python provides excellent, built-in support for time series data. Once loaded, Pandas also provides tools to explore and better understand your dataset.

## Daily Female Births Dataset

We will use the Daily Female Births Dataset as an example. This dataset describes the number of daily female births in California in 1959.The units are a count and there are 365 observations.

### Load Time Series Data

Pandas represented time series datasets as a Series. A Series1 is a one-dimensional array with a time label for each row. We can load the Daily Female Births dataset directly using the
Series class.

In [0]:
import pandas as pd
import numpy as np
from pandas import Series

Note the arguments to the read csv() function. We provide it a number of hints to ensure the data is loaded as a Series.

* **header=0**: We must specify the header information at row 0.
* **parse dates=[0]**: We give the function a hint that data in the first column contains dates that need to be parsed. This argument takes a list, so we provide it a list of one element, which is the index of the first column.
* **index col=0**: We hint that the first column contains the index information for the time series.
* **squeeze=True**: We hint that we only have one data column and that we are interested in a Series and not a DataFrame.

In [7]:
# load the time series as a Series object, instead of a DataFrame
series = pd.read_csv('daily-total-female-births.csv', header=0, parse_dates=[0], index_col=0, squeeze=True)
series.head()

date
1959-01-01    35
1959-01-02    32
1959-01-03    30
1959-01-04    31
1959-01-05    44
Name: births, dtype: int64

The series has a name, which is the column name of the data column. You can see that each row has an associated date. This is in fact not a column, but instead a time index for value.

As an index, there can be multiple values for one time, and values may be spaced evenly or unevenly across times.

## Exploring Time Series Data

Pandas also provides tools to explore and summarize your time series data.

### Peek at the Data

It is a good idea to take a peek at your loaded data to confirm that the types, dates, and data loaded as you intended. You can use the head() function to peek at the first 5 records or specify
the first n number of records to review.

In [8]:
# summarize first few lines of a file
series.head(10)

date
1959-01-01    35
1959-01-02    32
1959-01-03    30
1959-01-04    31
1959-01-05    44
1959-01-06    29
1959-01-07    45
1959-01-08    43
1959-01-09    38
1959-01-10    27
Name: births, dtype: int64

You can also use the tail() function to get the last n records of the dataset.

In [9]:
series.tail(10)

date
1959-12-22    39
1959-12-23    40
1959-12-24    38
1959-12-25    44
1959-12-26    34
1959-12-27    37
1959-12-28    52
1959-12-29    48
1959-12-30    55
1959-12-31    50
Name: births, dtype: int64

### Number of Observations

Another quick check to perform on your data is the number of loaded observations. This can help flush out issues with column headers not being handled as intended, and to get an idea on
how to effectively divide up data later for use with supervised learning algorithms. 

You can get the dimensionality of your Series using the size parameter.

In [10]:
# summarize the dimensions of a time series
series.size

365

### Querying By Time

You can slice, dice, and query your series using the time index.

This type of index-based querying can help to prepare summary statistics and plots while exploring the dataset.

In [11]:
# access all observations in January
series['1959-01']

date
1959-01-01    35
1959-01-02    32
1959-01-03    30
1959-01-04    31
1959-01-05    44
1959-01-06    29
1959-01-07    45
1959-01-08    43
1959-01-09    38
1959-01-10    27
1959-01-11    38
1959-01-12    33
1959-01-13    55
1959-01-14    47
1959-01-15    45
1959-01-16    37
1959-01-17    50
1959-01-18    43
1959-01-19    41
1959-01-20    52
1959-01-21    34
1959-01-22    53
1959-01-23    39
1959-01-24    32
1959-01-25    37
1959-01-26    43
1959-01-27    39
1959-01-28    35
1959-01-29    44
1959-01-30    38
1959-01-31    24
Name: births, dtype: int64

### Descriptive Statistics

Calculating descriptive statistics on your time series can help get an idea of the distribution and spread of values. 

This may help with ideas of data scaling and even data cleaning that you can perform later as part of preparing your dataset for modeling. 

The describe() function creates a 7 number summary of the loaded time series including mean, standard deviation, median, minimum, and maximum of the observations.

In [12]:
# calculate descriptive statistics
series.describe()

count    365.000000
mean      41.980822
std        7.348257
min       23.000000
25%       37.000000
50%       42.000000
75%       46.000000
max       73.000000
Name: births, dtype: float64