# Managing Time Series Data - Lab

## Introduction

In the previous lesson, you learned that time series data are everywhere and working with time series data is an important skill for data scientists!

In this lab, you'll practice your previously learned techniques to import, clean, and manipulate time series data.

The lab will cover how to perform time series analysis while working with large datasets. The dataset can be memory intensive so your computer will need at least 2GB of memory to perform some of the calculations.


## Objectives

You will be able to:

- Load time series data using Pandas and perform time series indexing 
- Perform data cleaning operation on time series data 
- Change the granularity of a time series 


## Let's get started!

Import the following libraries: 

* `pandas`, using the alias `pd` 
* `pandas.tseries` 
* `matplotlib.pyplot`, using the alias `plt` 
* `statsmodels.api`, using the alias `sm`

In [18]:
# Load required libraries
import numpy as np
import pandas as pd
import pandas.tseries
import matplotlib.pyplot as plt
%matplotlib inline
import statsmodels.api as sm

## Loading time series data
The `statsModels` library comes bundled with built-in datasets for experimentation and practice. A detailed description of these datasets can be found [here](http://www.statsmodels.org/dev/datasets/index.html). Using `statsModels`, the time series datasets can be loaded straight into memory. 

In this lab, we'll use the **Atmospheric CO2 from Continuous Air Samples at Mauna Loa Observatory, Hawaii, U.S.A.**, containing CO2 samples from March 1958 to December 2001. Further details on this dataset are available [here](http://www.statsmodels.org/dev/datasets/generated/co2.html).

In the following cell: 

- We loaded the `co2` dataset using the `.load()` method 
- Converted this into a pandas DataFrame 
- Renamed the columns 
- Set the `'date'` column as index 

In [19]:
# Load the 'co2' dataset from sm.datasets
data_set = sm.datasets.co2.load()

# load in the data_set into pandas data_frame
CO2 = pd.DataFrame(data=data_set['data'])

# Rename columns
CO2.rename(columns={'index': 'date'}, inplace=True)

In [20]:
CO2.head()

Unnamed: 0,date,co2
0,b'19580329',316.1
1,b'19580405',317.3
2,b'19580412',317.6
3,b'19580419',317.5
4,b'19580426',316.4


In [21]:
CO2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2284 entries, 0 to 2283
Data columns (total 2 columns):
date    2284 non-null object
co2     2225 non-null float64
dtypes: float64(1), object(1)
memory usage: 35.8+ KB


In [22]:
# create index (assumes a "date" column) and decode
index = pd.DatetimeIndex(data=CO2["date"].str.decode("utf-8"), freq='W-SAT', 
                         periods=CO2.date.size)
# use created index to set date column
CO2.set_index(index, inplace=True)
CO2.head()

Unnamed: 0_level_0,date,co2
date,Unnamed: 1_level_1,Unnamed: 2_level_1
1958-03-29,b'19580329',316.1
1958-04-05,b'19580405',317.3
1958-04-12,b'19580412',317.6
1958-04-19,b'19580419',317.5
1958-04-26,b'19580426',316.4


In [23]:
CO2 = CO2.drop(columns='date')
CO2.head()

Unnamed: 0_level_0,co2
date,Unnamed: 1_level_1
1958-03-29,316.1
1958-04-05,317.3
1958-04-12,317.6
1958-04-19,317.5
1958-04-26,316.4


Let's check the data type of `CO2` and also print the first 15 entries of `CO2` as our first exploratory step.

In [24]:
# Print the data type of CO2 
print(CO2.info())

# Print the first 15 rows of CO2
print(CO2.head(10))

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2284 entries, 1958-03-29 to 2001-12-29
Freq: W-SAT
Data columns (total 1 columns):
co2    2225 non-null float64
dtypes: float64(1)
memory usage: 35.7 KB
None
              co2
date             
1958-03-29  316.1
1958-04-05  317.3
1958-04-12  317.6
1958-04-19  317.5
1958-04-26  316.4
1958-05-03  316.9
1958-05-10    NaN
1958-05-17  317.5
1958-05-24  317.9
1958-05-31    NaN


With all the required packages imported and the `CO2` dataset as a Dataframe ready to go, we can move on to indexing our data.

## Date Indexing

While working with time series data in Python, having dates (or datetimes) in the index can be very helpful, especially if they are of `DatetimeIndex` type. Further details can be found [here](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Timestamp.html).

Print the `.index` attribute of the `CO2` DataFrame: 

In [25]:
# Confirm that date values are used for indexing purpose in the CO2 dataset 
CO2.index

DatetimeIndex(['1958-03-29', '1958-04-05', '1958-04-12', '1958-04-19',
               '1958-04-26', '1958-05-03', '1958-05-10', '1958-05-17',
               '1958-05-24', '1958-05-31',
               ...
               '2001-10-27', '2001-11-03', '2001-11-10', '2001-11-17',
               '2001-11-24', '2001-12-01', '2001-12-08', '2001-12-15',
               '2001-12-22', '2001-12-29'],
              dtype='datetime64[ns]', name='date', length=2284, freq='W-SAT')

The output above shows that our dataset clearly fulfills the indexing requirements. Look at the last line:


> **dtype='datetime64[ns]', length=2284, freq='W-SAT'**


* `dtype=datetime[ns]` field confirms that the index is made of timestamp objects.
* `length=2284` shows the total number of entries in our time series data. 

## Resampling

Remember that depending on the nature of analytical question, the resolution of timestamps can also be changed to other frequencies. For this dataset we can resample to monthly CO2 consumption values. This can be done by using the `.resample()` method as seen in the earlier lesson. 

* Group the data into buckets representing 1 month using `.resample()` method 
* Call the `.mean()` method on each group (i.e. get monthly average) 
* Combine the result as one row per monthly group 

In [27]:
# Group the time series into monthly buckets
CO2_monthly = CO2['co2'].resample('MS')

# Take the mean of each group 
CO2_monthly_mean = CO2_monthly.mean()

# Get the first 10 elements of resulting time series
CO2_monthly_mean.head(10)

date
1958-03-01    316.100000
1958-04-01    317.200000
1958-05-01    317.433333
1958-06-01           NaN
1958-07-01    315.625000
1958-08-01    314.950000
1958-09-01    313.500000
1958-10-01           NaN
1958-11-01    313.425000
1958-12-01    314.700000
Freq: MS, Name: co2, dtype: float64

Looking at the index values, we can see that our time series now carries aggregated data on monthly terms, shown as `Freq: MS`. 

### Time-series Index Slicing for Data Selection

Slice our dataset to only retrieve data points that come after the year 1990.

In [28]:
# Slice the timeseries to contain data after year 1990 
temp_1990_onwards = CO2['1990':]
print(temp_1990_onwards.head())
print(temp_1990_onwards.tail())

              co2
date             
1990-01-06  353.4
1990-01-13  353.5
1990-01-20  353.8
1990-01-27  353.9
1990-02-03  354.1
              co2
date             
2001-12-01  370.3
2001-12-08  370.8
2001-12-15  371.2
2001-12-22  371.3
2001-12-29  371.5


Retrieve data starting from Jan 1990 to Jan 1991: 

In [29]:
# Retrieve the data between 1st Jan 1990 to 1st Jan 1991
temp_1990_1991 = CO2['1990':'1991-01-02']
print(temp_1990_1991.head())
print(temp_1990_1991.tail())

              co2
date             
1990-01-06  353.4
1990-01-13  353.5
1990-01-20  353.8
1990-01-27  353.9
1990-02-03  354.1
              co2
date             
1990-12-01  353.6
1990-12-08  354.0
1990-12-15  353.8
1990-12-22  354.5
1990-12-29  354.8


## Missing Values

Find the total number of missing values in the dataset.

In [30]:
# Find the total number of missing values in the time series
CO2_monthly_mean.isnull().sum()

5

Remember that missing values can be filled in a multitude of ways. 

- Replace the missing values in `CO2_monthly_mean` with a previous valid value 
- Next, check if your attempt was successful by checking for number of missing values again 

In [31]:
# Perform backward filling of missing values
CO2_final = CO2_monthly_mean.ffill()

# Find the total number of missing values in the time series
CO2_final.isnull().sum()

0

Great! Now your time series data are ready for visualization and further analysis.

## Summary

In this introductory lab, you learned how to create a time series object in Python using Pandas. You learned how to check timestamp values as the data index and you learned about basic data handling techniques for time-series data for further analysis.