# Sensor Time-Series Analysis

* This notebook serves as an introduction to work on data collected from sensors embedded in the IoT devices.
> "Time series is a series of data points indexed (or listed or graphed) in time order."

![](http://)### A. Importing the required Libraries

In [None]:
## NumPy is a package in Python used for Scientific Computing. NumPy package is used to perform different operations. The ndarray (NumPy Array) is a multidimensional array used to store values of same datatype.
import numpy as np
## Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.
import pandas as pd
## Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits.
import matplotlib
import matplotlib.pyplot as plt
## Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
import seaborn as sns

In [None]:
## `%matplotlib` is a magic function in IPython. With this, the output of plotting commands is displayed inline within frontends like the Jupyter notebook, directly below the code cell that produced it. The resulting plots will then also be stored in the notebook document.
%matplotlib inline

> **NOTE:** If you need to know more about Python programming language, this free book is highly recommended:
* PDF: [A Whirlwind Tour of Python](https://www.oreilly.com/programming/free/files/a-whirlwind-tour-of-python.pdf)  
* Code: https://github.com/jakevdp/WhirlwindTourOfPython

### B. Importing the weather and energy dataset
The dataset contains the readings with a time span of 1 minute of house appliances in kW from a smart meter and weather conditions of that particular region.

#### Data Columns Descriptions:
(source: Data Source: https://www.kaggle.com/taranvee/smart-home-dataset-with-weather-information)
##### Index 
- **time**
    * Time of the readings, with a time span of 1 minute.

##### Energy Usage 
- **use [kW]**
    * Total energy consumption
- **gen [kW]**
    * Total energy generated by means of solar or other power generation resources
- **House overall [kW]**
    * overall house energy consumption
- **Dishwasher [kW]** 
    * energy consumed by specific appliance
- **Furnace 1 [kW]**
    * energy consumed by specific appliance
- **Furnace 2 [kW]**
    * energy consumed by specific appliance
- **Home office [kW]**
    * energy consumed by specific appliance
- **Fridge [kW]**
    * energy consumed by specific appliance
- **Wine cellar [kW]**
    * energy consumed by specific appliance
- **Garage door [kW]**
    * energy consumed by specific appliance
- **Kitchen 12 [kW]**
    * energy consumption in kitchen 1
- **Kitchen 14 [kW]**
    * energy consumption in kitchen 2
- **Kitchen 38 [kW]**
    * energy consumption in kitchen 3
- **Barn [kW]**
    * energy consumed by specific appliance
- **Well [kW]**
    * energy consumed by specific appliance
- **Microwave [kW]**
    * energy consumed by specific appliance
- **Living room [kW]**
    * energy consumption in Living room
- **Solar [kW]**
    * Solar power generation

##### Weather
- **temperature**:
    * Temperature is a physical quantity expressing hot and cold.
- **humidity**:
    * Humidity is the concentration of water vapour present in air.
- **visibility**:
    * Visibility sensors measure the meteorological optical range which is defined as the length of atmosphere over which a beam of light travels before its luminous flux is reduced to 5% of its original value.

- **apparentTemperature**:
    * Apparent temperature is the temperature equivalent perceived by humans, caused by the combined effects of air temperature, relative humidity and wind speed. The measure is most commonly applied to the perceived outdoor temperature.
- **pressure**: 
    * Falling air pressure indicates that bad weather is coming, while rising air pressure indicates good weather
- **windSpeed**:
    * Wind speed, or wind flow speed, is a fundamental atmospheric quantity caused by air moving from high to low pressure, usually due to changes in temperature.
- **cloudCover**:
    * Cloud cover (also known as cloudiness, cloudage, or cloud amount) refers to the fraction of the sky obscured by clouds when observed from a particular location. Okta is the usual unit of measurement of the cloud cover.
- **windBearing**:
    * In meteorology, an azimuth of 000° is used only when no wind is blowing, while 360° means the wind is from the North. True Wind Direction True North is represented on a globe as the North Pole. All directions relative to True North may be called "true bearings."
- **dewPoint**:
    * the atmospheric temperature (varying according to pressure and humidity) below which water droplets begin to condense and dew can form.
- **precipProbability**:
    * A probability of precipitation (POP), also referred to as chance of precipitation or chance of rain, is a measure of the probability that at least some minimum quantity of precipitation will occur within a specified forecast period and location.
- **precipIntensity**:
    * The intensity of rainfall is a measure of the amount of rain that falls over time. The intensity of rain is measured in the height of the water layer covering the ground in a period of time. It means that if the rain stays where it falls, it would form a layer of a certain height.
 
##### Others
- **summary**:
    * Report generated by the by the data collection systm (apparently!).
    * Including:
    ```
    Clear, Mostly Cloudy, Overcast, Partly Cloudy, Drizzle,
       Light Rain, Rain, Light Snow, Flurries, Breezy, Snow,
       Rain and Breezy, Foggy, Breezy and Mostly Cloudy,
       Breezy and Partly Cloudy, Flurries and Breezy, Dry,
       Heavy, Snow.
    ```
- **icon**:
    * The icon that is used by the data collection systm (apparently!).
    * Including:
    ```
    cloudy, clear-night, partly-cloudy-night, clear-day, partly-cloudy-day, rain, snow, wind, fog.
    ```
    

In [None]:
## pandas.read_csv: Read a comma-separated values (csv) file into DataFrame.
dataset = pd.read_csv("/kaggle/input/smart-home-dataset-with-weather-information/HomeC.csv", low_memory=False)
dataset.info()

In [None]:
tmp_str = "Feature(attribute)     DataType"; print(tmp_str+"\n"+"-"*len(tmp_str))
print(dataset.dtypes)

> To know more about Pandas DataFrame see this:
* https://towardsdatascience.com/pandas-dataframe-a-lightweight-intro-680e3a212b96

In [None]:
## Return a tuple representing the dimensionality of the DataFrame.
print("Shape of the data: {} --> n_rows = {}, n_cols = {}".format(dataset.shape, dataset.shape[0],dataset.shape[1]))

In [None]:
## pandas.DataFrame.head: This function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it.
dataset.head(10)

In [None]:
## This function returns last n rows from the object based on position. It is useful for quickly verifying data, for example, after sorting or appending rows.
dataset.tail(10)

> Wee see that the last row is invalid, so let's remove it.

In [None]:
dataset = dataset[0:-1] ## == dataset[0:dataset.shape[0]-1] == dataset[0:len(dataset)-1] == dataset[:-1]
dataset.tail()

> A numpy trick: `numpy.r` is the simple way to build up arrays quickly

In [None]:
np.r_[0:5, -5:0]

In [None]:
dataset.iloc[np.r_[0:5, -5:0]]

In [None]:
## pandas.DataFrame.columns: The column labels of the DataFrame.
dataset.columns

> Let's clean the columns names by removing the `[kW]` uint.

In [None]:
## Python string method replace() returns a copy of the string in which the occurrences of old have been replaced with new, optionally restricting the number of replacements to max.
dataset.columns = [col.replace(' [kW]', '') for col in dataset.columns]
dataset.columns

In [None]:
dataset['Kitchen'] = dataset[['Kitchen 12','Kitchen 14','Kitchen 38']].mean(axis=1)
dataset = dataset.drop(['Kitchen 12','Kitchen 14','Kitchen 38'], axis=1)

dataset['Furnace'] = dataset[['Furnace 1','Furnace 2']].mean(axis=1)
dataset = dataset.drop(['Furnace 1','Furnace 2'], axis=1)

dataset.head(3)

### B. Indexing rows by `Time`

In [None]:
## Unix Time  (https://en.wikipedia.org/wiki/Unix_time)
## It represents the number of seconds that have passed since 00:00:00 UTC Thursday, 1 January 1970.
dataset['time'].head()

> this large number represents a unix timestamp (i.e. "1284101485") in Python, and we'd like to convert them to a readable date.

In [None]:
import time 
print(' start ' , time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(1451624400)))

>  The dataset contains the readings with a time span of 1 minute of house appliances
in kW from a smart meter and weather conditions of that particular region.
So, we set `freq='min'` and convert Uinx time to readable date.

In [None]:
time_index = pd.date_range('2016-01-01 05:00', periods=len(dataset),  freq='min')  
time_index = pd.DatetimeIndex(time_index)
dataset = dataset.set_index(time_index)
dataset = dataset.drop(['time'], axis=1)
dataset.iloc[np.r_[0:5,-5:0]]

### C. ReSampling

In [None]:
dataset.shape

> We have 500K rows and each row shows the home status at a specific `minute`.
Let's plot the `temperature` data and see what is the result.

In [None]:
dataset['temperature'].plot(figsize=(25,5))

> It may seem too noisy to you. We can `resample` data by taking the `average temperature` every `day` and then plot it.

In [None]:
## pandas.DataFrame.resample: Convenience method for frequency conversion and resampling of time series. 
dataset['temperature'].resample(rule='D').mean().plot(figsize=(25,5))

> Here are the `rule`s you can use:
- B         business day frequency
- C         custom business day frequency (experimental)
- D         calendar day frequency
- W         weekly frequency
- M         month end frequency
- SM        semi-month end frequency (15th and end of month)
- BM        business month end frequency
- CBM       custom business month end frequency
- MS        month start frequency
- SMS       semi-month start frequency (1st and 15th)
- BMS       business month start frequency
- CBMS      custom business month start frequency
- Q         quarter end frequency
- BQ        business quarter endfrequency
- QS        quarter start frequency
- BQS       business quarter start frequency
- A         year end frequency
- BA, BY    business year end frequency
- AS, YS    year start frequency
- BAS, BYS  business year start frequency
- BH        business hour frequency
- H         hourly frequency
- T, min    minutely frequency
- S         secondly frequency
- L, ms     milliseconds
- U, us     microseconds
- N         nanoseconds

### Data Cleaning

> First, we fix our desired figure size.

In [None]:
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (25,5)

> Second, we look at the dataset columns

In [None]:
dataset.columns

> IT seems `use` and `House overall` show the same data.

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=1)
dataset['use'].resample('D').mean().plot(ax=axes[0])
dataset['House overall'].resample('D').mean().plot(ax=axes[1])

> It's better to remove one of them.

In [None]:
dataset = dataset.drop(columns=['House overall'])
dataset.shape

> Columns `summary` and `icon` are not numerical. In this tutorial we do not need them. 

In [None]:
dataset = dataset.drop(columns=['summary', 'icon'])
dataset.shape

In [None]:
## pandas.Series.unique: Uniques are returned in order of appearance. Hash table-based unique, therefore does NOT sort.
dataset['cloudCover'].unique()

> It seems for some rows we have an invalid value for `cloudCover`. 

In [None]:
dataset[dataset['cloudCover']=='cloudCover'].shape

> There are plenty of ways deal with this kind of invalid values. The simplest one is to remove rows that include this invalid value. but more sophisticated way is to replace them. see this: https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html

In [None]:
dataset['cloudCover'][56:60]

> We replace this missing valuess with the next valid observagion  we have.

In [None]:
dataset['cloudCover'].replace(['cloudCover'], method='bfill', inplace=True)
dataset['cloudCover'] = dataset['cloudCover'].astype('float')
dataset['cloudCover'].unique()

In [None]:
dataset['cloudCover'][56:60]

In [None]:
dataset.info()

> Now everything is neumerical. From now on, for the sake of simplicity, let's only work on `hourly` dataset.

In [None]:
dataset = dataset.resample('H').mean()
print("Shape of the data: {} --> n_rows = {}, n_cols = {}".format(dataset.shape, dataset.shape[0], dataset.shape[1]))

### Visualization

> We want to see what is the Microwave usage pattern during a day (24 hours)

In [None]:
dataset['Microwave'].resample("h").mean().iloc[:24].plot()

> The above plot just shows the usage for 1 specific day (02-Jan). What if we want average croos all the days.

In [None]:
dataset.groupby(dataset.index.hour).mean()['Microwave'].plot(xticks=np.arange(24)).set(xlabel='Daily Hours', ylabel='Microwave Usage (kW)')

> Now we see that a usual pattern around 11am-1pm and 16pm-18pm. However, at late night there is a weird usage!

## Moving Average

Let's start with a naive hypothesis: "tomorrow will be the same as today". However, instead of a model like $\hat{y}_{t} = y_{t-1}$ (which is actually a great baseline for any time series prediction problems and sometimes is impossible to beat), we will assume that the future value of our variable depends on the average of its $k$ previous values. Therefore, we will use the **moving average (MA)**.

$\hat{y}_{t} = \frac{1}{k} \displaystyle\sum^{k}_{n=1} y_{t-n}$
 > More info on MA: https://www.investopedia.com/terms/m/movingaverage.asp

> Pandas has an implementation available with [`DataFrame.rolling(window).mean()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html). The wider the window, the smoother the trend. In the case of very noisy data, which is often encountered in finance, this procedure can help detect common patterns.[](http://)

In [None]:
from sklearn.metrics import r2_score, median_absolute_error, mean_absolute_error
from sklearn.metrics import median_absolute_error, mean_squared_error, mean_squared_log_error

def mean_absolute_percentage_error(y_true, y_pred): 
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
def plotMovingAverage(series, window, plot_intervals=False, scale=1.96, plot_anomalies=False):

    """
        series - dataframe with timeseries
        window - rolling window size 
        plot_intervals - show confidence intervals
        plot_anomalies - show anomalies 
    """
    
    rolling_mean = series.rolling(window=window).mean()

    plt.figure(figsize=(25,5))
    plt.title("Moving average with window size = {}".format(window))
    plt.plot(rolling_mean, "g", label="Rolling mean trend")

    # Plot confidence intervals for smoothed values
    if plot_intervals:
        mae = mean_absolute_error(series[window:], rolling_mean[window:])
        deviation = np.std(series[window:] - rolling_mean[window:])
        lower_bond = rolling_mean - (mae + scale * deviation)
        upper_bond = rolling_mean + (mae + scale * deviation)
        plt.plot(upper_bond, "r--", label="Upper Bond / Lower Bond")
        plt.plot(lower_bond, "r--")
        
        # Having the intervals, find abnormal values
        if plot_anomalies:
            anomalies = pd.DataFrame(index=series.index, columns=series.columns)
            anomalies[series<lower_bond] = series[series<lower_bond]
            anomalies[series>upper_bond] = series[series>upper_bond]
            plt.plot(anomalies, "ro", markersize=10)
        
    plt.plot(series[window:], label="Actual values")
    plt.legend(loc="upper left")
    plt.grid(True)

n_samples = 24*30 # 1 month
cols = ['use']
plotMovingAverage(dataset[cols][:n_samples], window=6) # A window of 6 hours

In [None]:
plotMovingAverage(dataset[cols][:n_samples], window=12) # A window of 12 hours

### Anomaly Detection 

The simplest way to detect anomaly in time-series is using the moving average as the trend of the data and points that are feviate from the moving average be considered as anomaly.

In [None]:
plotMovingAverage(dataset[cols][:n_samples], window=24, plot_intervals=True, plot_anomalies=True)

> More info on Detecting Anomalies with Moving Average and Median Decomposition: https://anomaly.io/anomaly-detection-moving-median-decomposition/index.html

### Exponential smoothing

Now, let's see what happens if we start weighting all available observations while exponentially decreasing the weights as we move further back in time. There exists a formula for **[exponential smoothing](https://en.wikipedia.org/wiki/Exponential_smoothing)** that will help us with this:

$$\hat{y}_{t} = \alpha \cdot y_t + (1-\alpha) \cdot \hat y_{t-1} $$

Here the model value is a weighted average between the current true value and the previous model values. The $\alpha$ weight is called a smoothing factor. It defines how quickly we will "forget" the last available true observation. The smaller $\alpha$ is, the more influence the previous observations have and the smoother the series is.

Exponentiality is hidden in the recursiveness of the function -- we multiply by $(1-\alpha)$ each time, which already contains a multiplication by $(1-\alpha)$ of previous model values.

In [None]:
def exponential_smoothing(series, alpha):
    """
        series - dataset with timestamps
        alpha - float [0.0, 1.0], smoothing parameter
    """
    result = [series[0]] # first value is same as series
    for n in range(1, len(series)):
        result.append(alpha * series[n] + (1 - alpha) * result[n-1])
    return result

def plotExponentialSmoothing(series, alphas):
    """
        Plots exponential smoothing with different alphas
        
        series - dataset with timestamps
        alphas - list of floats, smoothing parameters
        
    """
    with plt.style.context('seaborn-white'):    
        plt.figure(figsize=(25, 5))
        for alpha in alphas:
            plt.plot(exponential_smoothing(series, alpha), label="Alpha {}".format(alpha))
        plt.plot(series.values, "c", label = "Actual")
        plt.legend(loc="best")
        plt.axis('tight')
        plt.title("Exponential Smoothing")
        plt.grid(True);

In [None]:
n_samples = 24*30 # 1 month
col = 'use'
plotExponentialSmoothing(dataset[col][:n_samples], [0.3, 0.05])

### Autoregressive Integrated Moving Average Model (ARIMA)
This acronym is descriptive, capturing the key aspects of the model itself. Briefly, they are:

- AR: Autoregression. A model that uses the dependent relationship between an observation and some number of lagged observations.
- I: Integrated. The use of differencing of raw observations (e.g. subtracting an observation from an observation at the previous time step) in order to make the time series stationary.
- MA: Moving Average. A model that uses the dependency between an observation and a residual error from a moving average model applied to lagged observations.

In [None]:
from statsmodels.tsa.arima_model import ARIMA
def forcast_ts(data, tt_ratio):
    X = data.values
    size = int(len(X) * tt_ratio)
    train, test = X[0:size], X[size:len(X)]
    history = [x for x in train]
    predictions = list()
    for t in range(len(test)):
        model = ARIMA(history, order=(5,1,0))
        model_fit = model.fit(disp=0)
        output = model_fit.forecast()
        yhat = output[0]
        predictions.append(yhat)
        obs = test[t]
        history.append(obs)
        print('progress:%',round(100*(t/len(test))),'\t predicted=%f, expected=%f' % (yhat, obs), end="\r")
    error = mean_squared_error(test, predictions)
    print('\n Test MSE: %.3f' % error)

    plt.rcParams["figure.figsize"] = (25,10)
    preds = np.append(train, predictions)
    plt.plot(list(preds), color='green', linewidth=3, label="Predicted Data")
    plt.plot(list(data), color='blue', linewidth=2, label="Original Data")
    plt.axvline(x=int(len(data)*tt_ratio)-1, linewidth=5, color='red')
    plt.legend()
    plt.show()

In [None]:
col = 'use'
data = dataset[col].resample('w').mean()
data.shape
tt_ratio = 0.70 # Train to Test ratio
forcast_ts(data, tt_ratio)

In [None]:
col = 'use'
data = dataset[col].resample('d').mean()
data.shape
tt_ratio = 0.70 # Train to Test ratio
forcast_ts(data, tt_ratio)

## A note on time series cross validation

Before we start building a model, let's first discuss how to estimate model parameters automatically.

There is nothing unusual here; as always, we have to choose a loss function suitable for the task that will tell us how closely the model approximates the data. Then, using cross-validation, we will evaluate our chosen loss function for the given model parameters, calculate the gradient, adjust the model parameters, and so on, eventually descending to the global minimum.

You may be asking how to do cross-validation for time series because time series have this temporal structure and one cannot randomly mix values in a fold while preserving this structure. With randomization, all time dependencies between observations will be lost. This is why we will have to use a more tricky approach in optimizing the model parameters. I don't know if there's an official name to this, but on [CrossValidated](https://stats.stackexchange.com/questions/14099/using-k-fold-cross-validation-for-time-series-model-selection), where one can find all answers but the Answer to the Ultimate Question of Life, the Universe, and Everything, the proposed name for this method is "cross-validation on a rolling basis".

The idea is rather simple -- we train our model on a small segment of the time series from the beginning until some $t$, make predictions for the next $t+n$ steps, and calculate an error. Then, we expand our training sample to $t+n$ value, make predictions from $t+n$ until $t+2*n$, and continue moving our test segment of the time series until we hit the last available observation. As a result, we have as many folds as $n$ will fit between the initial training sample and the last observation.

<img src="https://habrastorage.org/files/f5c/7cd/b39/f5c7cdb39ccd4ba68378ca232d20d864.png"/>


## Credits and Further Reading
 1. https://mlcourse.ai/articles/topic1-exploratory-data-analysis-with-pandas/
 2. https://mlcourse.ai/articles/topic2-visual-data-analysis-in-python/
 3. https://mlcourse.ai/articles/topic9-part1-time-series/
 4. https://mlcourse.ai/articles/topic9-part2-prophet/
 5. https://machinelearningmastery.com/arima-for-time-series-forecasting-with-python/