In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
data = pd.read_csv('../input/environmental-sensor-data-132k/iot_telemetry_data.csv', engine='python')

In [None]:
data

In [None]:
# !pip install -q pandas-profiling[notebook]
# from pandas_profiling import ProfileReport
# profile = ProfileReport(data, title='Pandas Profiling Report')
# profile.to_notebook_iframe()

# Tasks 

- The stated task is "Use ML to Determine when a Person is near IoT Device"

- As we don't have labels of time stamps where people are near the IoT device we are limited to unsupervised methods

- We do however, have three different devices in different locations, each with about a week of time series data recorded at a sampling interval of 5-10 seconds between measurements

- The proximity of a person could affect the recorded parameters in the various ways:

-- Would CO levels be impacted by the presence of an individual? Perhaps, if they are breathing air near the sensor the CO will stick to their Haemoglobin permanently (this is how Carbon Monoxide poisons you) effectively scrubbing some of it from the atmosphere like a filter and leading to detectable reduction in CO ppm

-- Humidity might increase if a person is exhaling into the room

-- Light levels might drop if a person is occluding the light sensor. Light levels might increase if a person turns on a light or opens a closet door

-- LPG levels might drop if a person breathes it in and thereby filters it from the local atmosphere

-- The movement of an individual near to a sensor would create detectable motion, though some information about other sources of motion/vibration would be needed to attribute it to a nearby person

-- Again, smoke levels could be reduced by the filtering effect of a persons lungs or increased if a person lights up a cigarette infront of the smoke sensor

-- The ambient temperature might be increased by the presence of a person(s) next to the temperature sensor for a period of time

# Approach

- Lets plot the time series with a meaningful time scale (time of day might indicate when people are more likely to be nearby)

- Lets find out if there are any daily patterns the data

- Lets see if there are any significant differences between the sensor time series between the three locations

- Attempt to define thresholds/confidence intervals/clustering on windows of time series data to define plausible "humans are nearby" intervals (i.e corresponding to light/motion/atmospheric/temperature changes)

# Exploring the time series

In [None]:
# convert the boolean columns to int32 for plotting
data['light_int'] = data['light'].astype('int32')
data['motion_int'] = data['motion'].astype('int32')

In [None]:
# convert unix time to time of day
from datetime import datetime, timedelta
start = datetime(1970, 1, 1)  # Unix epoch start time
data['datetime'] = data.ts.apply(lambda x: start + timedelta(seconds=x))
data['string_time'] = data.datetime.apply(lambda x: x.strftime('%Y-%m-%d %H:%M:%S'))

In [None]:
# separate out the data for the different devices with a groupby
data_device_gb = data.groupby('device')

In [None]:
for i in data_device_gb:
    print(i[0])

In [None]:
!pip install -q plotly

In [None]:
# plot our time series again with a more meaningful time axis and the ability to select individual sensor time series by double clicking on them in the legend

cols = data.columns
unwanted_cols = set(['motion','ts', 'device', 'light', 'datetime', 'string_time'])

import plotly.express as px 

plt_idx = 0
for z in data_device_gb:
    fig = px.line(log_y = True, title = z[0])
    for i, j in enumerate(cols):
       # print(i)
        if j in unwanted_cols:
            continue
        else:
            fig.add_scatter(x=z[1].iloc[:,-1], y=z[1].iloc[:,i], mode='lines')
            #print(i, j)
            fig.data[plt_idx].name = j
            plt_idx += 1

    fig.show()
    fig.data = []
    plt_idx = 0

## From the plot above we can see a few things:
-  CO, LPG and smoke levels (air quality metrics) are correlated for each device and vary over the time series and between devices
- Dramatic swings in temperature are recorded (are they real or the result of sensor malfunction?) as well as more moderate temperature oscillations 
- There are spikes in humidity and motion
- Illumination is either continuous or transient

In [None]:
subset = set(['smoke', 'humidity', 'temp'])
f, axes = plt.subplots(1,3, figsize=(30, 10))

for i, j in enumerate(subset):
    sns.boxplot(  y=data[j], x= "device", data=data, hue = 'device', orient='v' , ax=axes[i])

## Any differences between the three sensor devices in different locations?

00:0f:00:70:91:0a = 00

1c:bf:ce:15:ec:4d = 1c

b8:27:eb:bf:9d:51 = b8

-  There are clusters of motion spikes interspersed with moitionless intervals. b8 and 1c show far more motion spikes than 00
- 1c shows continuous illumination but the others have light and dark intervals 
- 00 has worse spikes in air quality than the other devices

- From the boxplots above we can see that the three devices are definitely located in distinct environments: 
    
    1. Ambient air pollution levels are highest in b8, followed by 1c and 00. 00 is less polluted most of the time but has more significant spikes of air pollution than the other devices
    2. The three devices are in locations will slightly different average temperatures, in the range of 20-30 degC. 00 and 1c show significant temperature drop outliers
    3. The three devices are also in locations with different humidity levels, in the range of roughly 50-75%. All three have some outliers showing increases and decreases in humidity which for 1c and 00 are substantial (65 to 0%)  

- We can do MANOVA (Multivariate ANalysis Of VAriance) to put some numbers on the differences in means for these variables across the three devices. Infact the statsmodel library has a class for it:

statsmodels.multivariate.manova.MANOVA

But the differences are fairly clear from the boxplots so lets move on

## Any daily patterns in the data?

- We can use facebook prophet to easily (if not rapidly) calculate and plot the hourly trends for our data. Lets take smoke across the three devices as an example

In [None]:
!pip install -q fbprophet

In [None]:
data['ds'] = data['datetime']
data['y'] = data['smoke']
data_device_gb = data.groupby('device')

In [None]:
# create a dictionary of dataframes from the groupby
df_dict = {}
for i, j in enumerate(data_device_gb):
    df_dict[i] = j[1]

In [None]:
df_dict[0][['ds','y']]

In [None]:
# Be advised - the code below, fitting the prophet model, takes a very long time to run
from fbprophet import Prophet

In [None]:
m = Prophet()

prophet_dict = {}
for i in df_dict:
    prophet_dict[i] = m.fit(df_dict[i][['ds','y']])
    m = Prophet()

In [None]:
future_dict = {}
for i in prophet_dict:
    m = prophet_dict[i]
    future_dict[i] = m.make_future_dataframe(periods=0, freq='H')
# future = m.make_future_dataframe(periods=0, freq='H')

In [None]:
fcst_dict = {}
for i in future_dict:
    m = prophet_dict[i]
    fcst_dict[i] = m.predict(future_dict[i])
# fcst = m.predict(future)

In [None]:
for i in fcst_dict:
    m = prophet_dict[i]
    fig = m.plot_components(fcst_dict[i])
    ax = fig.gca()
    ax.set_title("Smoke - Device {}".format(i+1), size=16, loc = 'right')
#fig = m.plot_components(fcst)

- From the fbprophet model above we can see that there is a trend for a fall in smoke levels from around 6am to 8 pm and an increase in smoke levels around midnight each day, with the same trend seen across the three device locations

- You could go through each sensor type and generate trend data as above. You could also groupby using the datetime column and group over a day time frame. From this you could calculate the mean and calculate confidence intervals, giving you similar trend information to that generate by fbprophet

- These confidence intervals could be the basis for an anomaly detection system (smoke alarm, human alarm etc)

In [None]:
# make the datetime column the index and then use index.day to groupby day

# DFList = []
# for group in df.groupby(df.index.day):
#     DFList.append(group[1])

# Unsupervised learning to identify time series windows where humans are near

### The central question here is defining which aspects of the time series can be attributed to human activity? It could be defined by:
- motion spikes - when humans are near they wobble the accelerometer

- light spikes - when humans open a door light falls on the detector 

- spikes in air pollution - when humans drive up to a sensor or turn on a machine fumes are produced

- temperature and humidity spikes - when humans open a door the temperature and humidity shift accordingly

### Really we need more information about the system and the impact of human behaivour within it, in order to build a system to flag human activity. Is human proximity an anomaly or a regular occurence? In what way do humans interact with the environment and how would this affect the sensor data?

### Nevertheless, from what we have here there are a few approaches to defining time series windows containing human proximity:

- Setting thresholds (if variable_x > cut_off_y: human_presence = True else: human_presence = False). If motion = human activity, we already have all we need for this

- Doing some unsupervised clustering to bin time series windows into "classes" or "clusters" which may be subsequently labelled as containing traces of "human proximity". 

- Defining confidence intervals in the time series to create anomaly detection thresholds (if human activity is the anomaly)

TBC

