In [1]:
# Author: Volker Hoffmann, SINTEF <volker.hoffmann@sintef.no> <volker@cheleb.net>
# Update: 12 December 2018

# Introduction

### Scope

In this notebook, we demonstrate how to

1. Load some data from an sensor, and
2. How to get some descriptive, visual, and predictive analytics

As an example, we use an air temperature sensor which was placed around in our offices for a while.

One fine day, something happened to the sensor. *Can you find out what?*

### What's This? How does it work?

What are working with right now is called an Jupyter (formerly IPython) notebook. It's a browser-based frontend to a Python process that is running on a some machine (can be remote or local). If you want to learn a bit more about this tool, have a look at https://jupyter.org/.

The most important feature of the notebook is the use of cells that can be executed independently. In a cell, you can 

- Either write code (which you can execute by hitting `SHIFT+ENTER`).
- Write markdown to tell a story.

If you have used Mathematica before, this should feel very familiar.

### Let's Start

Before we start our data exploration, we need to get some generic Python preamble out of the way.

The following cells load different libraries (called modules in Python) and set up our environment.

A big part of working in Python is knowing the right libraries to get stuff done. In this notebook, we use

- `Pandas` to handle our data
- `Matplotlib` and `Seaborn` to make figures
- `Scikit-learn` to compute clusters

Other helpers are

- `Glob` to get a list of files according to some pattern 
- `Numpy` to do some computations
- `Warnings` to modify the verbosity of warnings

In [2]:
%matplotlib inline

In [3]:
import matplotlib as mpl; mpl.rcParams['figure.dpi'] = 96
import matplotlib.pyplot as plt
import glob
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn.cluster

In [4]:
import warnings
warnings.filterwarnings('ignore')

# Load Data

The next few cells demonstrate how to load files into Pandas.

When we're done, we will have the data loaded into a structure called a Dataframe.

The core function to load data from a CSV file is `pd.read_csv()`.

You can think of it as an in-memory Excel table that you can use (to do machine learning, etc) and modify (calculate new columns, averages, etc) using Python.

The dataframe will be accessible as the variable `df`.

The final two cells are housekeeping tasks for convenience. We rename the `value` column (the name `value` was picked up from the CSV file header) and make sure the `timestamp` has the correct Pandas datatype (which allows advance time based operations).

In [5]:
# NB: Do this if the data is somewhere else
# datadir = '/home/volkerh/Notebooks/biglearn_course'
# g = glob.glob("%s/*.csv" % datadir)

In [6]:
g = glob.glob("*DigiFab*.csv")

In [7]:
print(g)

['TempAir-DigiFab_20006414_2018-10-01 00_00_00_2018-11-30 23_59_59.csv']


In [8]:
df = pd.read_csv(g[0])

In [9]:
df.rename(columns={'value': 'air_temperature'}, inplace=True)

In [10]:
df['timestamp'] = pd.to_datetime(df.timestamp)

# Inspect Data / Descriptive Analytics

Before we do anything, let us have a look at what we have in our dataframe.

- The command `df.head(5)` show the first five rows of the dataframe.
- The command `df.info()` shows general information (how many rows of data, what columns, what datatypes, etc)
- The command `df.describe()` computes and shows basic descriptive statistics (percentiles, averages, and counts) for each column

In [11]:
# maybe some code to look at the first 10 rows

In [12]:
# some code to get a general overview of the dataframe

In [13]:
# some code to compute some descriptive statistics

# Visual Analytics

### Time Series

Let's make some figures so we can actually see what the data looks like.

Try making a figure that plots the variables in the dataframe vs. time.

Some hints:

- The simplest way to plot is to use `plt.plot(x_vector, y_vector)` (pretty much like Matlab)
- To extract a particular column from your dataframe (which we have called `df`, do `df.NAME_OF_COLUMN`)
- To add axis labels, you can use `plt.xlabel('SOME_TEXT')` or `plt.ylabel('SOME_TEXT')`

In [14]:
# plot the time series

### Histograms & Kernel Density Estimates [1D]

Now, you may have found out that something interesting has happened to our sensor around **15 October** and **02 November**. Let's get back to the that.

For now, a nice way of summarizing what a time series of measurements does is to plot the distribution of measurements. In other words, we ignore the time dependence and just have a look what kind of measurement values we observe.

To easily plot some statistical summaries, we can use the `Seaborn` package. You way want to look at the `sns.distplot(vector_of_numbers)` function, which gives you a histogram as well as kernel density estimates.

In [15]:
# have a look at the statistical distribution of both variables in the dataframe

### Histograms [2D]

Now have a look at how the two sensor measurements relate.

The function `sns.jointplot(x_vector, y_vector)` will generate a plot of the joint and marginal distributions (as scatterplots and histograms, respectively).

In [16]:
# plot the marginal and distribution

If you stare at the result a bit, you should be able to come up with some hypothesis of what happened.

# Clustering

Humans have an outstanding visual processing capacity in their brains. Unfortuntaly, in the real world, we cannot have humans look at data streams from hundreds of sensors.

So we would like to automated the procedure of finding clusters of different operating conditions for our sensors.

This is called clustering.

We now try a clustering algorithm that will attempt to find the two major clusters we have so easily found in the image above.

To do clustering, we will be using the `Scikit-Learn` library. 

In `Scikit-Learn`, the programming interface for clustering algorithms (or most other algorithms, for that matter) is the same, i.e.

1. You instantiate the right class (e.g., `algo = sklearn.cluster.SOME_ALGO()`)
2. Fit the data (`algo.fit(matrix_with_data)`)
3. Recover the labels via the `algo.labels_` attribute

If you want to use, for example, DBSCAN, have a look at https://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#sphx-glr-auto-examples-cluster-plot-dbscan-py.

You can also look at the overview of clustering algorithms here: https://scikit-learn.org/stable/modules/clustering.html

Now, let's try to calculate and plot the clusters.

In [17]:
# some code to compute clusters

The plot the clusters, we can use the basic scatterplot function `plt.scatter(x_value, y_values)`. There's also an argument `c` you can use to pass the color for each sample.

See also https://matplotlib.org/api/_as_gen/matplotlib.pyplot.scatter.html.

If you can manage to use the labels (the number of the cluster) calculated by the clustering algorithm to make a scatterplot of the clusters, you should be able to do some interpretation.

In [18]:
# some code to plot the clusters

Ok, this is (probably) same figure as above, except that we (you) have coloured each point according to the cluster the **ALGORITHM THINKS** the point is in.

Depending on the clustering algorithm you selected (and its parameters), this may or may not have worked very well.

This, unfortunately, the reality of applying machine learning in practice. During the first experiments, things usually don't go very smoothly and it takes experience (and persistence) to

1. Select the most suitabel algorithms.
2. Adjust it properly.

This needs to be done with domain experts who know very well the proper operating regimes of their machines.

# Automated Anomaly Detection [Optional]

We should now have found out that something happened with out sensor along the way.

Ideally, we want to detect this automatically. Since the typical measurement range had a sharp transition, we can probably devise an simple (or not so simple) algorithm to detect and flag such jumps.

For example, you could base this on deviations from an acceptable value range. The range could be based on moving averages.

Of course, you can also use many other techniques to do outlier detection. There's a nice introduction here: https://scikit-learn.org/stable/modules/outlier_detection.html

However, if you want to start with simple moving averages, note the following Pandas tricks:

- You can calculate a rolling mean (over the past 128 samples) via `df[SOME_COLUMN_NAME].rolling(128).mean()`.
- You can calculate a rolling standard deviation via `df[SOME_COLUMN_NAME].rolling(128).mean()`.
- There are plenty more aggregation functions you can use, of course. See, for example, https://pandas.pydata.org/pandas-docs/stable/computation.html#window-functions
- To return a subset of a dataframe `df`, you can do `df[(df.COLUMN>10) & (df.COLUMN<5)]`. See also https://pandas.pydata.org/pandas-docs/stable/indexing.html

# Forecasting [Optional]

Of course, you could also want to do forecasting of the measurements.

A fairly straightforward implementation is available in the Facebook library `Prophet`, cf. https://facebook.github.io/prophet/

But **do note that**, in general, you should bring along some sort of expert (or educate yourself to a sufficient level) that you know exactly what the model does. Otherwise you may not be able to spot whether it functions as designed or is trying to predict nonsense.