In this Notebook we will analyse data from a water pump which experienced frequent failures in the period spring/summer 2018.

As input we have time series data from 52 sensors which measure different physical properties of the system (like temperature and pressure). We will try to extract the different working modes of the pump and highlight possible early warning signals of breakage.

As always, we will start with some (brief) exploratory analysis, with the aim of examining missing or redundant data.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Let's now import pandas and pyplot, and load the dataset.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
data =  pd.read_csv("../input/pump-sensor-data/sensor.csv")

# Data exploration and cleaning

Let's have a first look at the data

In [None]:
data.head()

Time of each event is recorded in the timestamp column. Since the format is string, we create a new columns where time is registered as pd.Timestamp.

We will also drop the 'Unnamed: 0' column since it is just a row count

In [None]:
# Let's convert the data type of timestamp column to datatime format
data['datetime'] = pd.to_datetime(data['timestamp'])
data.drop(['timestamp', 'Unnamed: 0'], axis=1, inplace=True)
data.head()

Now we can plot the data and quickly analyse some of the patterns

In [None]:
# Extract the readings from the BROKEN state of the pump
broken = data[data['machine_status']=='BROKEN']
recovering = data[data['machine_status']=='RECOVERING']
# Extract the names of the numerical columns
sensors_to_plot = data.columns[0:5:2]
# Plot time series for each sensor with BROKEN state marked with X in red color
for name in sensors_to_plot:
    plt.figure(figsize=(18,3))
    plt.plot(broken['datetime'], broken[name], linestyle='none', marker='X', color='red', markersize=12, label='broken')
    plt.plot(recovering['datetime'], recovering[name], linestyle='none', marker='X', color='orange', markersize=6, label='recovering')
    plt.plot(data['datetime'], data[name], color='blue', label='working')
    plt.title(name)
    plt.legend()
    plt.show()

A few observations:
- Sensor measurements vary a lot in range
- Sensors are likely to be refering to very different quantities (some temperature, some pressure, etc)

We will need to rescale all the sensor data separately, so that values along each column are in the same range (0 to 1).

Let's count explicitly the number of falures of the pump

In [None]:
data['machine_status'].value_counts()

**Question 1** \
In order to detect early signals of pump failure, and given the data distrubution, what do you think would be the best approach

a) Supervised learning, with classification to predict failure events \
b) Supervised learning with regression \
c) Unsupervised learning and anomaly detection \
d) None of the above: we have too little data

For each sensor, we want now to print the percentage of missing data

In [None]:
perc_nans = data.isnull().sum().sort_values(ascending=False)/len(data)
perc_nans.head(10)

We will create a new dataframe called df_tidy and drop all duplicates and all sensors with >3% of missing data. We will also drop 'Unnamed: 0' cause it is just a row count 

In [None]:
df_tidy = data.drop_duplicates()
df_tidy.drop(['sensor_15', 'sensor_50', 'sensor_51', 'sensor_00'], axis=1, inplace=True)

# K-means clustering and operating modes

Since we do not have enough data for supervised learning, we will explore the different working regimes of the pump with unsupervised learning. We will be using Kmeans clustering

In [None]:
from sklearn.cluster import KMeans

We create X_train, a new dataset which comprises only of sensor data (no labels, no timestamps). 

Then, we will rescale the sensor data so that they lay in a similar range. If we subtract the minimum across the column and divide by the maximum, all the values will be between 0 and 1

In [None]:
X_train = df_tidy.drop(['machine_status', 'datetime'], axis=1)
X_train -= X_train.min()
X_train /= X_train.max()
X_train.head()

We now want to fill the missing values for each column. 

**Question 2** \
When filling time series null values, which approach would you use? \
e.g. you have the temperature values in Munich for April, but you are missing 10 minutes in one of the days. What do you use to fill those values?

a) copy the last registered temperature before missing values \
b) average temperature across the day \
c) copy the temperature from previous day at the same time \
d) average temperature in April

To fill time series missing values, we can use pandas ffill

In [None]:
X_train.fillna(method='ffill', inplace=True)

**Elbow method**

The elbow method is a heuristic method to decide how many parameters to use for kmeans clustering. 

In [None]:
inertia = []
 
for k in range(1, 15):
    # Building and fitting the model with k clusters
    kmeanModel = KMeans(n_clusters=k).fit(X_train)
    inertia.append(kmeanModel.inertia_)

In [None]:
K = range(1, 15)
plt.figure(figsize=(7,5))
plt.plot(K, inertia, 'bx-')
plt.xlabel('Values of K')
plt.ylabel('Distortion')
plt.title('The Elbow Method using Inertia')
plt.show()

**Question 3** \
Using the elbow method, from the previous graph what would you use as the number of clusters for k-means?

a) 2 \
b) 4 \
c) 5 \
d) 6

In [None]:
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans.fit(X_train)
labels = kmeans.predict(X_train)

unique_elements, counts_elements = np.unique(labels, return_counts=True)
clusters = np.asarray((unique_elements, counts_elements))
df_tidy['cluster'] = labels

In [None]:
colors = ['limegreen', 'orange', 'red', 'yellow', 'cyan']
colors_plot = [colors[i] for i in df_tidy['cluster'].values]
sensors_to_plot = ['sensor_01']
for name in sensors_to_plot:
    plt.figure(figsize=(18,3))
    plt.plot(df_tidy['datetime'], df_tidy[name], color='blue', label='sensor data')
    plt.vlines(df_tidy['datetime'], 32, 55, color=colors_plot, alpha=0.5)
    plt.title(name)
    plt.legend()
    plt.show()

# Cluster visualisation

We can use t-SNE to project the clusters onto 2d and plot them, to have a rough idea of their geometrical relationships

In [None]:
from sklearn.manifold import TSNE

subsampling_step = 500
X_subset = X_train.loc[::subsampling_step]
colors_plot = [colors[i] for i in df_tidy['cluster'].loc[::subsampling_step]]
X_embedded = TSNE(perplexity = 30, random_state=42).fit_transform(X_subset)
plt.figure(figsize=(7,5))
plt.scatter(X_embedded[:,0], X_embedded[:,1], color=colors_plot)

# Back to sensors: centroid analysis

We have found different regimes of the water pump. What are the sensors that most change in between these different regimes?

In [None]:
centroid_distance = np.abs(kmeans.cluster_centers_[0] - kmeans.cluster_centers_[4])
changed_sensors = np.argsort(centroid_distance)

In [None]:
for sensor_idx in changed_sensors[-3:]:
    sensor_data = df_tidy.iloc[:, sensor_idx]
    plt.figure(figsize=(18,3))
    plt.plot(df_tidy['datetime'], sensor_data, color='blue', label='changed_sensor')
    plt.title(f'Sensor {df_tidy.columns[sensor_idx]}')
    plt.legend()
    plt.show()

The next steps would be to characterise each of these regimes, and then do the anomaly detection. In the case of this dataset, we do not have enough data to describe the 'normal' working mode of the system, so it is very difficult to label anything as anomalous. Knowledge of what the sensors refer to, or more information on why the pump sensors shift so often would certainly help.

**Question 4** \
What other signal processing techniques would you use for this problem?