# Managing Time Series Data - Lab

## Introduction

In the previous lesson, you learned that time series data are everywhere and working with time series data is an important skill for data scientists!

In this lab, you'll practice your previously learned techniques to import, clean, and manipulate time series data.

The lab will cover how to perform time series analysis while working with large datasets. The dataset can be memory intensive so your computer will need at least 2GB of memory to perform some of the calculations.


## Objectives

You will be able to:

- Load time series data using Pandas and perform time series indexing 
- Perform data cleaning operation on time series data 
- Change the granularity of a time series 


## Let's get started!

Import the following libraries: 

* `pandas`, using the alias `pd` 
* `pandas.tseries` 
* `matplotlib.pyplot`, using the alias `plt` 
* `statsmodels.api`, using the alias `sm`

In [2]:
# Load required libraries
import pandas as pd
import pandas.tseries
import matplotlib.pyplot as plt
import statsmodels.api as sm

## Loading time series data
The `statsModels` library comes bundled with built-in datasets for experimentation and practice. A detailed description of these datasets can be found [here](http://www.statsmodels.org/dev/datasets/index.html). Using `statsModels`, the time series datasets can be loaded straight into memory. 

In this lab, we'll use the **Atmospheric CO2 from Continuous Air Samples at Mauna Loa Observatory, Hawaii, U.S.A.**, containing CO2 samples from March 1958 to December 2001. Further details on this dataset are available [here](http://www.statsmodels.org/dev/datasets/generated/co2.html).

In the following cell: 

- We loaded the `co2` dataset using the `.load()` method 
- Converted this into a pandas DataFrame 
- Renamed the columns 
- Set the `'date'` column as index 

In [3]:
# Load the 'co2' dataset from sm.datasets
data_set = sm.datasets.co2.load()

# load in the data_set into pandas data_frame
CO2 = pd.DataFrame(data=data_set['data'])
CO2.rename(columns={'index': 'date'}, inplace=True)

# set index to date column
CO2.set_index('date', inplace=True)

CO2.head()



Unnamed: 0_level_0,co2
date,Unnamed: 1_level_1
1958-03-29,316.1
1958-04-05,317.3
1958-04-12,317.6
1958-04-19,317.5
1958-04-26,316.4


Let's check the data type of `CO2` and also print the first 15 entries of `CO2` as our first exploratory step.

In [4]:
# Print the data type of CO2 


# Print the first 15 rows of CO2
CO2.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2284 entries, 1958-03-29 to 2001-12-29
Data columns (total 1 columns):
co2    2225 non-null float64
dtypes: float64(1)
memory usage: 35.7 KB


In [5]:
CO2.head(15)

Unnamed: 0_level_0,co2
date,Unnamed: 1_level_1
1958-03-29,316.1
1958-04-05,317.3
1958-04-12,317.6
1958-04-19,317.5
1958-04-26,316.4
1958-05-03,316.9
1958-05-10,
1958-05-17,317.5
1958-05-24,317.9
1958-05-31,


With all the required packages imported and the `CO2` dataset as a Dataframe ready to go, we can move on to indexing our data.

## Date Indexing

While working with time series data in Python, having dates (or datetimes) in the index can be very helpful, especially if they are of `DatetimeIndex` type. Further details can be found [here](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Timestamp.html).

Print the `.index` attribute of the `CO2` DataFrame: 

In [6]:
# Confirm that date values are used for indexing purpose in the CO2 dataset 
CO2.index

DatetimeIndex(['1958-03-29', '1958-04-05', '1958-04-12', '1958-04-19',
               '1958-04-26', '1958-05-03', '1958-05-10', '1958-05-17',
               '1958-05-24', '1958-05-31',
               ...
               '2001-10-27', '2001-11-03', '2001-11-10', '2001-11-17',
               '2001-11-24', '2001-12-01', '2001-12-08', '2001-12-15',
               '2001-12-22', '2001-12-29'],
              dtype='datetime64[ns]', name='date', length=2284, freq=None)

The output above shows that our dataset clearly fulfills the indexing requirements. Look at the last line:


> **dtype='datetime64[ns]', length=2284, freq='W-SAT'**


* `dtype=datetime[ns]` field confirms that the index is made of timestamp objects.
* `length=2284` shows the total number of entries in our time series data. 

## Resampling

Remember that depending on the nature of analytical question, the resolution of timestamps can also be changed to other frequencies. For this dataset we can resample to monthly CO2 consumption values. This can be done by using the `.resample()` method as seen in the earlier lesson. 

* Group the data into buckets representing 1 month using `.resample()` method 
* Call the `.mean()` method on each group (i.e. get monthly average) 
* Combine the result as one row per monthly group 

In [8]:
# Group the time series into monthly buckets
CO2_monthly = CO2.resample('MS')

# Take the mean of each group 
CO2_monthly_mean = CO2_monthly.mean()

# Get the first 10 elements of resulting time series
CO2_monthly_mean.head(10)

Unnamed: 0_level_0,co2
date,Unnamed: 1_level_1
1958-03-01,316.1
1958-04-01,317.2
1958-05-01,317.433333
1958-06-01,
1958-07-01,315.625
1958-08-01,314.95
1958-09-01,313.5
1958-10-01,
1958-11-01,313.425
1958-12-01,314.7


In [9]:
CO2_monthly_mean.index

DatetimeIndex(['1958-03-01', '1958-04-01', '1958-05-01', '1958-06-01',
               '1958-07-01', '1958-08-01', '1958-09-01', '1958-10-01',
               '1958-11-01', '1958-12-01',
               ...
               '2001-03-01', '2001-04-01', '2001-05-01', '2001-06-01',
               '2001-07-01', '2001-08-01', '2001-09-01', '2001-10-01',
               '2001-11-01', '2001-12-01'],
              dtype='datetime64[ns]', name='date', length=526, freq='MS')

Looking at the index values, we can see that our time series now carries aggregated data on monthly terms, shown as `Freq: MS`. 

### Time-series Index Slicing for Data Selection

Slice our dataset to only retrieve data points that come after the year 1990.

In [10]:
# Slice the timeseries to contain data after year 1990 
CO2[1990:]

Unnamed: 0_level_0,co2
date,Unnamed: 1_level_1
1996-05-18,365.7
1996-05-25,365.4
1996-06-01,364.8
1996-06-08,365.1
1996-06-15,365.2
...,...
2001-12-01,370.3
2001-12-08,370.8
2001-12-15,371.2
2001-12-22,371.3


Retrieve data starting from Jan 1990 to Jan 1991: 

In [16]:
# Retrieve the data between 1st Jan 1990 to 1st Jan 1991
CO2['1990-01-01':'1991-01-01']

Unnamed: 0_level_0,co2
date,Unnamed: 1_level_1
1990-01-06,353.4
1990-01-13,353.5
1990-01-20,353.8
1990-01-27,353.9
1990-02-03,354.1
1990-02-10,355.0
1990-02-17,354.8
1990-02-24,354.7
1990-03-03,355.7
1990-03-10,354.9


## Missing Values

Find the total number of missing values in the dataset.

In [17]:
# Find the total number of missing values in the time series
CO2.isna().sum()

co2    59
dtype: int64

Remember that missing values can be filled in a multitude of ways. 

- Replace the missing values in `CO2_monthly_mean` with a previous valid value 
- Next, check if your attempt was successful by checking for number of missing values again 

In [19]:
# Perform backward filling of missing values
CO2_final = CO2.ffill()

# Find the total number of missing values in the time series
CO2_final.isna().sum()

co2    0
dtype: int64

Great! Now your time series data are ready for visualization and further analysis.

## Summary

In this introductory lab, you learned how to create a time series object in Python using Pandas. You learned how to check timestamp values as the data index and you learned about basic data handling techniques for time-series data for further analysis.