# Using SURFsara IoT platform for Sensemakers - demo #1

This notebook shows how to:
- access files in the shared volume
- load raw data JSON files into a Pandas DataFrame
- perform simple data manipulations
- inspect data
- produce plots
- store results in the project volume for later use

## Accessing files in the shared volume

The individual messages processed by the automated data pipeline are appended to files for a given project/device and a calendar date. The naming convention for the directories/files is `/data/app_id/dev_id-YYYY-MM-DD.json`. The shared volume is accessible from Jupyter notebooks in read-only mode.

The files in the shared volume can be listed in the following way:

In [None]:
!ls /home/shared/WON/SMA-A42924*

The files contain raw messages in the JSON format line-by-line.

In [None]:
!head /home/shared/WON/SMA-A42924-2019-10-09.json

## Load raw data JSON files into a pandas DataFrame

We choose to use [pandas](https://pandas.pydata.org/) DataFrame to analyse data. First, we need to install corresponding Python package.

In [None]:
!pip install --upgrade pip
!pip install pandas

A file from the shared volume can be loaded like this with a single command.

In [None]:
import pandas as pd

# Load a single JSON file into a Pandas DataFrame.
df = pd.read_json('/home/shared/WON/SMA-A42924-2019-10-09.json', lines=True)

# Show the datafame.
df

Multiple files can be loaded like this:

In [None]:
import glob

# List all files for project WON and device SMA-A42924.
files = glob.glob('/home/shared/WON/SMA-A42924*')

# Define an empty dataframe.
df0 = pd.DataFrame()

# Loop over all files and load them to the dataframe.
for file in files:
    tmp = pd.read_json(file, lines=True)
    df0 = df0.append(tmp, ignore_index=True)
    
# Show the dataframe.
df0

The following commands may come handy for getting basic information about the dataframe.

In [None]:
df0.info()

In [None]:
# Show the first few lines of the dataframe.
df0.head()

In [None]:
# Show the types of the columns.
df0.dtypes

## Simple data manipulations

The dataframe we have loaded from the files in the cells above is not handy for data analytics yet. The most important values - the sensor measurements - are not easily accessible because they are all stored in a single column `payload_fields` as a dictionarly of key-value pairs. Therefore, we will reformat the dataframe such that every sensor measurement gets its own column.

In the previous section, we loaded all raw data files in dataframe `df0`. In the following cell, we will create a new dataframe `df1` with the new columns.

In [None]:
# Extract payload_fields as individual cloumns.
payload_fields = df0['payload_fields'].apply(pd.Series)

# Add the new columns to the dataframe.
df1 = df0.join(payload_fields)

# Remove the original column.
df1 = df1.drop('payload_fields', axis=1)

# Show the dataframe.
df1

In [None]:
# Show the column types.
df1.dtypes

Convert the unix epoch time to human-readable time format.

In [None]:
df1["datetime"] = pd.to_datetime(df1["time"], unit='ms')
df1["datetime"]

## Data inspection

In the previous section, we created dataframe `df1` that is convenient for analysis/inspection.

Show basic statistics for the data in numeric columns.

In [None]:
df1.describe()

Show the messages with `fail` = 1.

In [None]:
df1[df1["fail"] == 1]

Remove the messages with `fail` = 1.

In [None]:
df2 = df1[df1["fail"] != 1]

## Data visualisations

We choose to use [matplotlib](https://matplotlib.org/) to analyse data. First, we need to install corresponding Python package.

In [None]:
!pip install matplotlib

Make sure the plots will be produced directly in the notebook.

In [None]:
%matplotlib inline

Plot histograms.

In [None]:
df2["hum"].hist()

In [None]:
df2["temp"].hist()

The temperature histogram indicates there are outliers. This will become obvious in a histogram with logarithmic y-axis.

In [None]:
import matplotlib.pyplot as plt

plt.hist(df2['temp'], log=True) 

Box plots are also useful.

In [None]:
df2[["hum", "temp"]].plot.box()

Remove outliers.

In [None]:
df3 = df2[df2["temp"] < 100]

In [None]:
df3.plot(x="datetime", y="temp")

## Store results in the project volume

Save the plot as a pdf file. The file can be downloaded from the file browser on the left.

In [None]:
plot = df3.plot(x="datetime", y="hum")
plot.get_figure().savefig('hum.pdf', format='pdf')