# Using Machine Learning to Forecast Time Series (Tutorial Series)

## Intro to the series:

This is a six part series based on the  "Time Series Forecasting With Python Mini-Course" by Jason Brownlee at [MachineLearningMastery.com](https://machinelearningmastery.com/category/time-series/). We will start by exploring the basics of manipulating and visualising time series data, before moving on to persistent forecasting models, and then finally autoregressive (AR) and ARIMA models.

**Lesson 1:** Import and Explore Data

**Lesson 2:** Visualising Time Series

**Lesson 3:** Persistence Forecast Model

**Lesson 4:** Autoregressive Forecast Model

**Lesson 5:** ARIMA Forecast Model

**Lesson 6:** End to End Project

## Lesson 1:    Import and Explore Data

### Lesson objective: 

#### *Practice loading and exploring time series data in Python:*

Before we can develop models for forecasting time series, we must load our data and look at it's structure. We will also generate some simple descriptive statistics to get an idea of the distribution of data points.

The Pandas library offers excellent functionality for loading data from .csv files.

In this lesson, we will load a standard time series dataset in Pandas and explore it. To follow along properly, you should have some knowledge of the pandas package, as well as basic knowledge of python.

This lesson uses a time series dataset of Female Births in California 1959 that was sourced from [datamarket.com](https://datamarket.com).

*In the next lesson, we will cover using plotting libraries to visualize our time series data.*

**1. Import the data using the pandas.read_csv() function.**

*First, import the necessary modules:*

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os # Miscellaneous operating system interfaces

*Now we can use pandas read_csv function. We will specify arguments to set the column names, and the row number of the csv header. We will also tell pandas that we want to parse the 'date' column as datetime64, and make it the index for the dataframe (this will make slicing our data easier later on).*

In [None]:
fpath = '../input/daily-total-female-births-in-california-1959/daily-total-female-births-CA.csv'
df = pd.read_csv(fpath, names=['date', 'births'], header=0, parse_dates=['date'], index_col='date')

**2. Check the Dataframe structure and values**

**a)	Print the first few rows using the head() method.**

*Print the first 5 rows of the dataframe using the [.head()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html) method:
Note that .head() defaults to the first 5 rows.*

In [None]:
df.head()

Using head allows us to verify that the dataframe and .csv have the same values for the first few rows. We can do the same for the last rows using [.tail()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.tail.html), this time let's try specifying a parameter for how many rows to display:

In [None]:
df.tail(10)

**2. b) 	 Look at the structure of the dataframe using the info() method.**

In [None]:
df.info()

It's important to check our data is imported properly, with no missing or unexpected values and the correct data types for each column. Using the [info()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html) method we can see that the dataframe has 365 entries between 1959-01-01 and 1959-12-31, and that there are no null values in the births column. We can also use the [.size](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.size.html) attribute of the dataframe to check the number of elements in the dataframe. 

Another way to check that there are no missing values in our dataframe is to use the [.count()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html) method, this returns a count of the non null values and data type for each column:

In [None]:
print('There are {} elements in the DataFrame'.format(df.size))
df.count()

**4.	Query the dataset using date-time strings:**

**a) Select all dates from a single month.**

In [None]:
# Select all date from December 1959 (month 12)
df['1959-12']

**4. b) Select all records between the two dates (inclusive).**

In [None]:
# Make a boolean mask. start_date and end_date can be datetime.datetimes, np.datetime64s, pd.Timestamps, or even datetime strings:
start_date = '1959-03-23'
end_date = '1959-04-02'
mask = (df.index >= start_date) & (df.index <= end_date)
df[mask]

# Alternate approaches using .loc method
# df.loc['1959-03-23':'1959-04-02']
# df.loc[start_date:end_date]

N.B. If we wanted to use and exclusive start or end date, we simply remove the equals operator from the comparison (e.g. '>=' goes to '>').

**4. c) Select a single date**

In [None]:
df.loc['1959-01-24']

We can also use the same method to select all rows of data from a single day when we have a time series with a higher frequency (e.g. in hours, minutes, or seconds).

In [None]:
# Create data frame
hourly = pd.DataFrame()

# Create random integer values using numpy.random
avg = df['births'].div(24).mean()
stdev = df['births'].div(24).std()
hourly['births'] = np.random.normal(loc=avg, scale=stdev, size=(24*365)).astype(int)

# Create datetimes, one per hour for each day in 1959
hourly.index = pd.date_range('1/1/1959', periods=(24*365), freq='H')

# Select all rows from a single date
hourly.loc['1959-07-04']

**5.	Print summary statistics of observations in the dataframe.**

In [None]:
df.describe()

[Describe](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html) gives us descriptive statistics about the distribution of the data, including the centrality (mean), range (min & max), and spread (percentiles and standard deviation). These give us an initial overview of the data and can help us see if there might be outliers or anomalies in the observations.

**Assignments**

To cement your learning I would suggest forking this kernel and playing with the parameters and data to get practice working with time series data.
Try a few of your own date values for the data slicing, can you find alternate syntax for slicing the data?
Can you successfully make changes to the code and fix any errors? 
How does changing the parameters effect the output?
*Finally, try repeating the steps above with the other datasets included with this kernel (see the code blocks below this one).*

**Extra Credit**

Find some time series datasets of your own to practice with, there are many availabe for free, you can find them in [Kaggle's datasets](https://www.kaggle.com/datasets?sortBy=hottest&group=public&page=1&pageSize=20&size=all&filetype=all&license=all&tagids=6618) and [datamarket.com](https://datamarket.com/data/list/?q=provider:tsdl)

*A lot of the learning process for picking up a programming language involves solving unexpected problems to gain real understanding of the language's syntax and how it functions. The best way to pick up a programming langauge is by trial and error, and through debugging code.*

**Acknowledgements**

This lesson was adapted from the first lesson in Jason Brownlee's free mini course "Time Series Forecasting With Python Mini-Course" at [MachineLearningMastery.com](https://machinelearningmastery.com/category/time-series/).
I'd also like to give a shout out to Chris Albon who runs a very good data science blog at [ChrisAlbon.com](https://chrisalbon.com/). There are tonnes of tutorials  covering a diverse range of topics on pandas and machine learning, all with excellent explanations and easy to follow code snippets.

**Thanks!**

Thank you for taking the time to read my first public kernel! I hope that you enjoyed it, and perhaps learned something too. If this was a bit basic for you, please check out the lessons later in the series (coming soon, links will be added to this kernel), where I explore more advanced topics.
I encourage you all to leave your feedback and criticism in the comments below, particularly if you have any questions or feel that more could be added :).

In [None]:
# Get a list 'filename' of all practice files
folder = '../input/time-series-practice-datasets/'
filename = os.listdir(folder)
filename

In [None]:
# Create a new DataFrame 'df', referencing a file name from the list
df = pd.read_csv(folder+filename[4])
df.head()

Note that you can change the list index to practice with a different file, e.g.
    *filename[4]*
There is also an Excel spreadsheet (.xlsx) file in the practice files folder so that you can practice loading data from a different file format.
    e.g. *pd_read_excel(folder+filename[3])*
   

In [None]:
dfxl = pd.read_excel(folder+filename[3])
dfxl.head()