<h2 style="text-align:center;font-size:200%;;">IoT Temperature Forecasting</h2>
<h3  style="text-align:center;">Keywords : <span class="label label-success">IoT</span> <span class="label label-success">Time Series Analysis</span> <span class="label label-success">Pre-processing</span> <span class="label label-success">EDA</span> <span class="label label-success">Bayesian Modeling</span></h3>

# Table of Contents<a id='top'></a>

>1. [Overview](#1.-Overview)  
>   * [Project Detail](#Project-Detail)
>   * [Goal of this notebook](#Goal-of-this-notebook)
>1. [Import libraries](#2.-Import-libraries)
>1. [Load the dataset](#3.-Load-the-dataset)
>1. [Pre-processing](#4.-Pre-processing)
>   * [Datetime information](#Datetime-information)
>   * [Seasonal information](#Seasonal-information)
>   * [Timing information](#Timing-information)
>   * [Unique identifier defined by id](#Unique-identifier-defined-by-id)
>1. [EDA](#5.-EDA)  
>   * [Univariate Analysis](#Univariate-Analysis)
>   * [Multivariate Analysis](#Multivariate-Analysis)
>   * [Time Series Analysis](#Time-Series-Analysis)
>1. [Modelling](#6.-Modelling)
>    * [Data Preparation](#Data-Preparation)
>    * [Build Model & Predict Future Temperature](#Build-Model-&-Predict-Future-Temperature)
>1. [Conclusion](#7.-Conclusion)
>    * [Task Submission](#Task-Submission)
>1. [References](#8.-References)

<a href="#top" class="btn btn-success btn-sm active" role="button" aria-pressed="true" style="color:white;">Table of Contents</a>

# 1. Overview
## Project Detail
><p>In <a href='https://www.kaggle.com/atulanandjha/temperature-readings-iot-devices'>this Dataset</a>, we have the temperature readings from IoT devices installed outside and inside of an anonymous room. Because the device was in the testing phase, it was uninstalled or shut off several times during the entire reading period, which caused some outliers and missing-values.</p><br/>
>Dataset details:
><ul>
>    <li><b>id</b> : unique IDs for each reading</li>
>    <li><b>room_id/id</b> : room id in which device was installed(currently 'admin room' only for example purpose).</li>
>    <li><b>noted_date</b> : date and time of reading</li>
>    <li><b>temp</b> : temperature readings</li>
>    <li><b>out/in</b> : whether reading was taken from device installed inside or outside of room</li>
></ul>
>We can enjoy finding out the following:
><ul>
>    <li>the relationship of inside and outside temperature</li>
>    <li>trend or seasonality in the data</li>
>    <li>forecasting future temperature by using time-series modeling</li>
>    <li>characteristic tendency through year, month, week or day/night</li>
>    <li>and so on...</li>
></ul>

## Goal of this notebook
>* Practice data cleansing technique
>* Practice EDA technique to deal with time-series data
>    * Series Decomposition into trend/seasonality
>* Practice visualising technique
>* Practice time-series modeling technique
>    * Prophet

<a href="#top" class="btn btn-success btn-sm active" role="button" aria-pressed="true" style="color:white;">Table of Contents</a>

# 2. Import libraries

In [None]:
import numpy as np
import pandas as pd
import holoviews as hv
from holoviews import opts
hv.extension('bokeh')
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
import os
from fbprophet import Prophet
from fbprophet.plot import add_changepoints_to_plot

<a href="#top" class="btn btn-success btn-sm active" role="button" aria-pressed="true" style="color:white;">Table of Contents</a>

# 3. Load the dataset

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv("/kaggle/input/temperature-readings-iot-devices/IOT-temp.csv")
print(f'IOT-temp.csv : {df.shape}')
df.head(3)

<a href="#top" class="btn btn-success btn-sm active" role="button" aria-pressed="true" style="color:white;">Table of Contents</a>

# 4. Pre-processing

>column 'room_id/id' has only one value(Room Admin), so we don't need this column for analysis.

In [None]:
df['room_id/id'].value_counts()

In [None]:
df.drop('room_id/id', axis=1, inplace=True)
df.head(3)

>changing column names to understand easily

In [None]:
df.rename(columns={'noted_date':'date', 'out/in':'place'}, inplace=True)
df.head(3)

<a href="#top" class="btn btn-success btn-sm active" role="button" aria-pressed="true" style="color:white;">Table of Contents</a>

## Datetime information
>datetime column has a lot of information such as year, month, weekday and so on. To utilize these information in EDA and modeling phase, we need extract them from datetime column.

In [None]:
df['date'] = pd.to_datetime(df['date'], format='%d-%m-%Y %H:%M')
df['year'] = df['date'].apply(lambda x : x.year)
df['month'] = df['date'].apply(lambda x : x.month)
df['day'] = df['date'].apply(lambda x : x.day)
df['weekday'] = df['date'].apply(lambda x : x.day_name())
df['weekofyear'] = df['date'].apply(lambda x : x.weekofyear)
df['hour'] = df['date'].apply(lambda x : x.hour)
df['minute'] = df['date'].apply(lambda x : x.minute)
df.head(3)

<a href="#top" class="btn btn-success btn-sm active" role="button" aria-pressed="true" style="color:white;">Table of Contents</a>

## Seasonal information
><div class="alert alert-success" role="alert">
>Let's assume this data was collected in India.<br/>
>According to <a href='https://en.wikipedia.org/wiki/Climate_of_India#Seasons'>this wiki page</a>, India has four climatological seasons as below.
><ul>
>    <li><b>Winter</b> : December to February</li>
>    <li><b>Summer</b> : March to May</li>
>    <li><b>Monsoon</b> : June to September</li>
>    <li><b>Post-monsoon</b> : October to November</li>
></ul>
>We can create seasonal variable based on month variable.<br/>
><u>The idea came from <a href='https://www.kaggle.com/satishkundanagar/temp-reading-iot-devices-eda'>this notebook.</a></u>
></div>

>function to convert month variable into seasons

In [None]:
def month2seasons(x):
    if x in [12, 1, 2]:
        season = 'Winter'
    elif x in [3, 4, 5]:
        season = 'Summer'
    elif x in [6, 7, 8, 9]:
        season = 'Monsoon'
    elif x in [10, 11]:
        season = 'Post_Monsoon'
    return season

In [None]:
df['season'] = df['month'].apply(month2seasons)
df.head(3)

<a href="#top" class="btn btn-success btn-sm active" role="button" aria-pressed="true" style="color:white;">Table of Contents</a>

## Timing information
><div class="alert alert-success" role="alert">
>Hour variable can be broken into Night, Morning, Afternoon and Evening based on its number.
><ul>
>    <li><b>Night</b> : 22:00 - 23:59 / 00:00 - 03:59</li>
>    <li><b>Morning</b> : 04:00 - 11:59</li>
>    <li><b>Afternoon</b> : 12:00 - 16:59</li>
>    <li><b>Evening</b> : 17:00 - 21:59</li>
></ul>
>We can create timing variable based on hour variable.<br/>
><u>The idea came from <a href='https://www.kaggle.com/satishkundanagar/temp-reading-iot-devices-eda'>this notebook.</a></u>
></div>

In [None]:
def hours2timing(x):
    if x in [22,23,0,1,2,3]:
        timing = 'Night'
    elif x in range(4, 12):
        timing = 'Morning'
    elif x in range(12, 17):
        timing = 'Afternoon'
    elif x in range(17, 22):
        timing = 'Evening'
    else:
        timing = 'X'
    return timing

In [None]:
df['timing'] = df['hour'].apply(hours2timing)
df.head(3)

<a href="#top" class="btn btn-success btn-sm active" role="button" aria-pressed="true" style="color:white;">Table of Contents</a>

## Unique identifier defined by id
><div class="alert alert-success" role="alert">
>Column 'id' seems to have some information related to 'date' column. <br/>
>Column 'date' doesn't have seconds information, so 'id' may have seconds information or some uniqueness of when the data was collected.<br/>
><u>The idea came from <a href='https://www.kaggle.com/satishkundanagar/temp-reading-iot-devices-eda'>this notebook.</a></u>
></div>

### Duplication
>After checking whether any record is duplicated, it turned out that there were duplicate records. So we need to put duplicate records into one unique record.

In [None]:
df[df.duplicated()]

In [None]:
df[df['id']=='__export__.temp_log_196108_4a983c7e']

In [None]:
df.drop_duplicates(inplace=True)
df[df.duplicated()]

<a href="#top" class="btn btn-success btn-sm active" role="button" aria-pressed="true" style="color:white;">Table of Contents</a>

### Uniqueness of id
><div class="alert alert-success" role="alert">
>Looking closely at 'id' column, it seemed to have unique values and two decomposable components, numeric and alpha-numeric.<br/>
>In the case of 'id' of <b>'__export__.temp_log_101144_ff2f0b97'</b>, it can be decomposed into two parts.
><ul>
>    <li><b>numeric part</b> : 101144</li>
>    <li><b>alpha-numeric part</b> : ff2f0b97</li>
></ul>
>Alpha-numeric part looks impossible to understand, but <u>numeric part may indicate <b>uniqueness</b> or <b>sortability</b> of each records, for example seconds information</u>.
></div>

>In the same datetime(<b>2018-09-12 03:09:00</b>), there are many records and unique ids.

In [None]:
df.loc[df['date']=='2018-09-12 03:09:00', ].sort_values(by='id').head(5)

>The count of numeric parts in 'id' have the same number as the length of the entire data, so the numeric parts indicate uniqueness of each records.

In [None]:
df['id'].apply(lambda x : x.split('_')[6]).nunique() == len(df)

>Adding numeric parts in 'id' as new identifier.

In [None]:
df['id'] = df['id'].apply(lambda x : int(x.split('_')[6]))
df.head(3)

<a href="#top" class="btn btn-success btn-sm active" role="button" aria-pressed="true" style="color:white;">Table of Contents</a>

### Gaps in id
><div class="alert alert-success" role="alert">
>Selecting one unique datetime(<b>2018-09-12 03:09:00</b>) and sorting by 'id', it turned out that there were <u>some gaps</u> in 'id' column.<br/>
>This fact makes it little difficult to understand mapping of 'id' to 'date'.
><ul>
>    <li><b>17003 - 17006</b> : 17004 and 17005 missing</li>
>    <li><b>17006 - 17009</b> : 17007 and 17008 missing</li>
></ul>
>On the other hand, selecting certain range of 'id'(<b>4000-4010</b>) and sorting by its number, it turned out that there was <u>a gap in 'date' between 'id' 4004 and the others</u>.<br/>
>Sorting by 'id', it must have orderliness in 'date'. But <u>in 'id' 4004 'date' have former datetime compared to the previous 'id'</u>.
><ul>
>    <li><b>4002</b> : 2018-09-09 16:<font color='red'>24</font>:00</li>
>    <li><b>4004</b> : 2018-09-09 16:<font color='red'>23</font>:00</li>
></ul>
>So it can be said that <u>'id' column is not related to second information, but it can be used as <b>a unique identifier</b> of each records</u>.
></div>

>There are gaps in 'id' column.

In [None]:
df.loc[df['date'] == '2018-09-12 03:09:00', ].sort_values(by ='id').head(5)

>There is a gap in 'date' column when ordered by 'id'.

In [None]:
df.loc[df['id'].isin(range(4000, 4011))].sort_values(by='id')

<a href="#top" class="btn btn-success btn-sm active" role="button" aria-pressed="true" style="color:white;">Table of Contents</a>

# 5. EDA

## Univariate Analysis

### Monthly Readings

In [None]:
month_rd = np.round(df['date'].apply(lambda x : x.strftime("%Y-%m")).value_counts(normalize=True).sort_index() * 100,decimals=1)
month_rd_bar = hv.Bars(month_rd).opts(color="green")
month_rd_curve = hv.Curve(month_rd).opts(color="red")
(month_rd_bar * month_rd_curve).opts(title="Monthly Readings Count", xlabel="Month", ylabel="Percentage", yformatter='%d%%', width=700, height=300,tools=['hover'],show_grid=True)

### Temperature
>Temperature clearly consists of multiple distributions.

In [None]:
hv.Distribution(df['temp']).opts(title="Temperature Distribution", color="green", xlabel="Temperature", ylabel="Density")\
                            .opts(opts.Distribution(width=700, height=300,tools=['hover'],show_grid=True))

### Place

In [None]:
pl_cnt = np.round(df['place'].value_counts(normalize=True) * 100)
hv.Bars(pl_cnt).opts(title="Readings Place Count", color="green", xlabel="Places", ylabel="Percentage", yformatter='%d%%')\
                .opts(opts.Bars(width=700, height=300,tools=['hover'],show_grid=True))

### Season

In [None]:
season_cnt = np.round(df['season'].value_counts(normalize=True) * 100)
hv.Bars(season_cnt).opts(title="Season Count", color="green", xlabel="Season", ylabel="Percentage", yformatter='%d%%')\
                .opts(opts.Bars(width=700, height=300,tools=['hover'],show_grid=True))

### Timing

In [None]:
timing_cnt = np.round(df['timing'].value_counts(normalize=True) * 100)
hv.Bars(timing_cnt).opts(title="Timing Count", color="green", xlabel="Timing", ylabel="Percentage", yformatter='%d%%')\
                .opts(opts.Bars(width=700, height=300,tools=['hover'],show_grid=True))

<a href="#top" class="btn btn-success btn-sm active" role="button" aria-pressed="true" style="color:white;">Table of Contents</a>

## Multivariate Analysis

### Monthly Readings by Place

In [None]:
in_month = np.round(df[df['place']=='In']['date'].apply(lambda x : x.strftime("%Y-%m")).value_counts(normalize=True).sort_index() * 100, decimals=1)
out_month = np.round(df[df['place']=='Out']['date'].apply(lambda x : x.strftime("%Y-%m")).value_counts(normalize=True).sort_index() * 100, decimals=1)
in_out_month = pd.merge(in_month,out_month,right_index=True,left_index=True).rename(columns={'date_x':'In', 'date_y':'Out'})
in_out_month = pd.melt(in_out_month.reset_index(), ['index']).rename(columns={'index':'Month', 'variable':'Place'})
hv.Bars(in_out_month, ['Month', 'Place'], 'value').opts(opts.Bars(title="Monthly Readings by Place Count", width=700, height=400,tools=['hover'],show_grid=True, ylabel="Count"))

### Temperature Distribution by Place
><div class="alert alert-success" role="alert">
><ul>
><li>Inside temperature is composed of a single distribution, while <u>outside temperature is composed of multiple distributions.</u></li>
><li>It seems that the temperature inside the room is kept constant by the air conditioner, but <u>the outside temperature is easily affected by time-series factors such as seasons.</u></li>
></ul>
></div>

In [None]:
(hv.Distribution(df[df['place']=='In']['temp'], label='In') * hv.Distribution(df[df['place']=='Out']['temp'], label='Out'))\
                                .opts(title="Temperature by Place Distribution", xlabel="Temperature", ylabel="Density")\
                                .opts(opts.Distribution(width=700, height=300,tools=['hover'],show_grid=True))

### Temperature by Season

In [None]:
season_agg = df.groupby('season').agg({'temp': ['min', 'max']})
season_maxmin = pd.merge(season_agg['temp']['max'],season_agg['temp']['min'],right_index=True,left_index=True)
season_maxmin = pd.melt(season_maxmin.reset_index(), ['season']).rename(columns={'season':'Season', 'variable':'Max/Min'})
hv.Bars(season_maxmin, ['Season', 'Max/Min'], 'value').opts(title="Temperature by Season Max/Min", ylabel="Temperature")\
                                                                    .opts(opts.Bars(width=700, height=300,tools=['hover'],show_grid=True))

### Temperature by Timing

In [None]:
timing_agg = df.groupby('timing').agg({'temp': ['min', 'max']})
timing_maxmin = pd.merge(timing_agg['temp']['max'],timing_agg['temp']['min'],right_index=True,left_index=True)
timing_maxmin = pd.melt(timing_maxmin.reset_index(), ['timing']).rename(columns={'timing':'Timing', 'variable':'Max/Min'})
hv.Bars(timing_maxmin, ['Timing', 'Max/Min'], 'value').opts(title="Temperature by Timing Max/Min", ylabel="Temperature")\
                                                                    .opts(opts.Bars(width=700, height=300,tools=['hover'],show_grid=True))

<a href="#top" class="btn btn-success btn-sm active" role="button" aria-pressed="true" style="color:white;">Table of Contents</a>

## Time Series Analysis

### Pre-processing for time-series analysis
>It is easy to try time-series analysis with unique time-index data. So we need to calculate mean values by 'date' column and delete 'id' column.

In [None]:
tsdf = df.drop_duplicates(subset=['date','place']).sort_values('date').reset_index(drop=True)
tsdf['temp'] = df.groupby(['date','place'])['temp'].mean().values
tsdf.drop('id', axis=1, inplace=True)
tsdf.head(3)

### Monthly Temperature Mean
><div class="alert alert-success" role="alert">
><ul>
><li>The <b>outside</b> temperature has a larger time series change than the <b>inside</b> temperature.</li>
><li>It is thought that the inside temperature is adjusted by air conditioner, but <u>the outside temperature is affected by seasonal temperature fluctuations.</u></li>
><ul>
></div>

In [None]:
in_month = tsdf[tsdf['place']=='In'].groupby('month').agg({'temp':['mean']})
in_month.columns = [f"{i[0]}_{i[1]}" for i in in_month.columns]
out_month = tsdf[tsdf['place']=='Out'].groupby('month').agg({'temp':['mean']})
out_month.columns = [f"{i[0]}_{i[1]}" for i in out_month.columns]
hv.Curve(in_month, label='In') * hv.Curve(out_month, label='Out').opts(title="Monthly Temperature Mean", ylabel="Temperature", xlabel='Month')\
                                                                    .opts(opts.Curve(width=700, height=300,tools=['hover'],show_grid=True))

### Daily Temperature Mean

In [None]:
tsdf['daily'] = tsdf['date'].apply(lambda x : pd.to_datetime(x.strftime('%Y-%m-%d')))
in_day = tsdf[tsdf['place']=='In'].groupby(['daily']).agg({'temp':['mean']})
in_day.columns = [f"{i[0]}_{i[1]}" for i in in_day.columns]
out_day = tsdf[tsdf['place']=='Out'].groupby(['daily']).agg({'temp':['mean']})
out_day.columns = [f"{i[0]}_{i[1]}" for i in out_day.columns]
(hv.Curve(in_day, label='In') * hv.Curve(out_day, label='Out')).opts(title="Daily Temperature Mean", ylabel="Temperature", xlabel='Day', shared_axes=False)\
                                                                    .opts(opts.Curve(width=700, height=300,tools=['hover'],show_grid=True))

### Weekday Temperature Mean

In [None]:
in_wd = tsdf[tsdf['place']=='In'].groupby('weekday').agg({'temp':['mean']})
in_wd.columns = [f"{i[0]}_{i[1]}" for i in in_wd.columns]
in_wd['week_num'] = [['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'].index(i) for i in in_wd.index]
in_wd.sort_values('week_num', inplace=True)
in_wd.drop('week_num', axis=1, inplace=True)
out_wd = tsdf[tsdf['place']=='Out'].groupby('weekday').agg({'temp':['mean']})
out_wd.columns = [f"{i[0]}_{i[1]}" for i in out_wd.columns]
out_wd['week_num'] = [['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'].index(i) for i in out_wd.index]
out_wd.sort_values('week_num', inplace=True)
out_wd.drop('week_num', axis=1, inplace=True)
hv.Curve(in_wd, label='In') * hv.Curve(out_wd, label='Out').opts(title="Weekday Temperature Mean", ylabel="Temperature", xlabel='Weekday')\
                                                                    .opts(opts.Curve(width=700, height=300,tools=['hover'],show_grid=True))

### WeekofYear Temperature Mean

In [None]:
in_wof = tsdf[tsdf['place']=='In'].groupby('weekofyear').agg({'temp':['mean']})
in_wof.columns = [f"{i[0]}_{i[1]}" for i in in_wof.columns]
out_wof = tsdf[tsdf['place']=='Out'].groupby('weekofyear').agg({'temp':['mean']})
out_wof.columns = [f"{i[0]}_{i[1]}" for i in out_wof.columns]
hv.Curve(in_wof, label='In') * hv.Curve(out_wof, label='Out').opts(title="WeekofYear Temperature Mean", ylabel="Temperature", xlabel='WeekofYear')\
                                                                    .opts(opts.Curve(width=700, height=300,tools=['hover'],show_grid=True))

<a href="#top" class="btn btn-success btn-sm active" role="button" aria-pressed="true" style="color:white;">Table of Contents</a>

### Missing data
><div class="alert alert-success" role="alert">
><ul>
><li>Plotting overall data, it is found that there are some missing data points randomly through whole period.</li>
><li>Interpolating with <b>'nearest'</b> method looks better(yet far from best), but there are many gaps in interpolated data yet.</li>
></ul>
></div>

In [None]:
in_tsdf = tsdf[tsdf['place']=='In'].reset_index(drop=True)
in_tsdf.index = in_tsdf['date']
in_all = hv.Curve(in_tsdf['temp']).opts(title="[In] Temperature All", ylabel="Temperature", xlabel='Time', color='red')

out_tsdf = tsdf[tsdf['place']=='Out'].reset_index(drop=True)
out_tsdf.index = out_tsdf['date']
out_all = hv.Curve(out_tsdf['temp']).opts(title="[Out] Temperature All", ylabel="Temperature", xlabel='Time', color='blue')

in_tsdf_int = in_tsdf['temp'].resample('1min').interpolate(method='nearest')
in_tsdf_int_all = hv.Curve(in_tsdf_int).opts(title="[In] Temperature All Interpolated with 'nearest'", ylabel="Temperature", xlabel='Time', color='red', fontsize={'title':11})
out_tsdf_int = out_tsdf['temp'].resample('1min').interpolate(method='nearest')
out_tsdf_int_all = hv.Curve(out_tsdf_int).opts(title="[Out] Temperature All Interpolated with 'nearest'", ylabel="Temperature", xlabel='Time', color='blue', fontsize={'title':11})

(in_all + in_tsdf_int_all + out_all + out_tsdf_int_all).opts(opts.Curve(width=400, height=300,tools=['hover'],show_grid=True)).opts(shared_axes=False).cols(2)

><div class="alert alert-success" role="alert">
><ul>
><li>In order to forecast future temperature, we need to convert data into rough granularity.</li>
><li>Using interpolated daily mean data looks good solution.</li>
></ul>
></div>

In [None]:
in_d_org = hv.Curve(in_day).opts(title="[In] Daily Temperature Mean", ylabel="Temperature", xlabel='Time', color='red')
out_d_org = hv.Curve(out_day).opts(title="[Out] Daily Temperature Mean", ylabel="Temperature", xlabel='Time', color='blue')

inp_df = pd.DataFrame()
in_d_inp = in_day.resample('1D').interpolate('spline', order=5)
out_d_inp = out_day.resample('1D').interpolate('spline', order=5)
inp_df['In'] = in_d_inp.temp_mean
inp_df['Out'] = out_d_inp.temp_mean

in_d_inp_g = hv.Curve(inp_df['In']).opts(title="[In] Daily Temperature Mean Interpolated with 'spline'", ylabel="Temperature", xlabel='Time', color='red', fontsize={'title':10})
out_d_inp_g = hv.Curve(inp_df['Out']).opts(title="[Out] Daily Temperature Mean Interpolated with 'spline'", ylabel="Temperature", xlabel='Time', color='blue', fontsize={'title':10})

(in_d_org + in_d_inp_g + out_d_org + out_d_inp_g).opts(opts.Curve(width=400, height=300,tools=['hover'],show_grid=True)).opts(shared_axes=False).cols(2)

<a href="#top" class="btn btn-success btn-sm active" role="button" aria-pressed="true" style="color:white;">Table of Contents</a>

# 6. Modelling
><div class="alert alert-success" role="alert">
>Building time-series model to predict future temperature inside/outside room by Prophet.<br/>
>I chose Prophet this time for time-series modeling tool based on below reasons.
><ul>
><li>Automatic detection of trend and seasonality</li>
><li>Robustness against outliers</li>
><li>Customizable seasonalities</li>
><li>No need for fine parameter tuning</li>
></ul>
></div>

## Data Preparation
>In addition to temperature information, I added season information, which is a time-series factor that affects temperature (especially outside).

In [None]:
org_df = inp_df.reset_index()
org_df['season'] = org_df['daily'].apply(lambda x : month2seasons(x.month))
org_df = pd.get_dummies(org_df, columns=['season'])
org_df.head(3)

## Build Model & Predict Future Temperature

In [None]:
def run_prophet(place, prediction_periods, plot_comp=True):
    # make dataframe for training
    prophet_df = pd.DataFrame()
    prophet_df["ds"] = pd.date_range(start=org_df['daily'][0], end=org_df['daily'][133])
    prophet_df['y'] = org_df[place]
    # add seasonal information
    prophet_df['monsoon'] = org_df['season_Monsoon']
    prophet_df['post_monsoon'] = org_df['season_Post_Monsoon']
    prophet_df['winter'] = org_df['season_Winter']

    # train model by Prophet
    m = Prophet(changepoint_prior_scale=0.1, yearly_seasonality=2, weekly_seasonality=False)
    # include seasonal periodicity into the model
    m.add_seasonality(name='season_monsoon', period=124, fourier_order=5, prior_scale=0.1, condition_name='monsoon')
    m.add_seasonality(name='season_post_monsoon', period=62, fourier_order=5, prior_scale=0.1, condition_name='post_monsoon')
    m.add_seasonality(name='season_winter', period=93, fourier_order=5, prior_scale=0.1, condition_name='winter')
    m.fit(prophet_df)

    # make dataframe for prediction
    future = m.make_future_dataframe(periods=prediction_periods)
    # add seasonal information
    future_season = pd.get_dummies(future['ds'].apply(lambda x : month2seasons(x.month)))
    future['monsoon'] = future_season['Monsoon']
    future['post_monsoon'] = future_season['Monsoon']
    future['winter'] = future_season['Winter']

    # predict the future temperature
    prophe_result = m.predict(future)

    # plot prediction
    fig1 = m.plot(prophe_result)
    ax = fig1.gca()
    ax.set_title(f"{place} Prediction", size=25)
    ax.set_xlabel("Time", size=15)
    ax.set_ylabel("Temperature", size=15)
    a = add_changepoints_to_plot(ax, m, prophe_result)
    fig1.show()
    # plot decomposed timse-series components
    if plot_comp:
        fig2 = m.plot_components(prophe_result)
        fig2.show()

<a href="#top" class="btn btn-success btn-sm active" role="button" aria-pressed="true" style="color:white;">Table of Contents</a>

In [None]:
run_prophet('In',30)

<a href="#top" class="btn btn-success btn-sm active" role="button" aria-pressed="true" style="color:white;">Table of Contents</a>

In [None]:
run_prophet('Out',30)

<a href="#top" class="btn btn-success btn-sm active" role="button" aria-pressed="true" style="color:white;">Table of Contents</a>

# 7. Conclusion
><div class="alert alert-success" role="alert">
><ul>
><li>ID column was not used as useful information, but it was used as a unique identifier for each row.</li>
><li>We got some useful information such as <b>seasonal information or timing information</b> for analysis from datetime column.</li>
><li>Inside temperature is composed of a single distribution, <u>while outside temperature is composed of multiple distributions.</u></li>
><li>Outside temperature can be <u>more affected by seasonal temperature fluctuations</u> than inside temperature.</li>
><li>So many drops in the data made it difficult to build model, so interpolating daily-mean data by 'spline' method worked.</li>
><li>Some outliers made it difficult to build forecasting model, but thanks to Prophet it is thought we built robust model against outliers.</li>
></ul>
></div>

## Task Submission
>Through the EDA & Modeling above, we can answer [several tasks](https://www.kaggle.com/atulanandjha/temperature-readings-iot-devices/tasks).

### How outside temp was related to inside temp ?
><div class="alert alert-info" role="alert">
>Answer : 
><ul>
><li>Outside temperature is composed of <b>multiple distribution</b>, while inside temperature has single distribution.</li>
><li>Inside temperature has flat trend, but <u>outside temperature has the trend that is seemed to be affected by time-series factor such as seasonality.</u></li>
></ul>
></div>

In [None]:
dist = (hv.Distribution(df[df['place']=='In']['temp'], label='In') * hv.Distribution(df[df['place']=='Out']['temp'], label='Out'))\
                                .opts(title="Temperature by Place Distribution", xlabel="Temperature", ylabel="Density",tools=['hover'],show_grid=True, fontsize={'title':11})
tsdf['daily'] = tsdf['date'].apply(lambda x : pd.to_datetime(x.strftime('%Y-%m-%d')))
in_day = tsdf[tsdf['place']=='In'].groupby(['daily']).agg({'temp':['mean']})
in_day.columns = [f"{i[0]}_{i[1]}" for i in in_day.columns]
out_day = tsdf[tsdf['place']=='Out'].groupby(['daily']).agg({'temp':['mean']})
out_day.columns = [f"{i[0]}_{i[1]}" for i in out_day.columns]
curve = (hv.Curve(in_day, label='In') * hv.Curve(out_day, label='Out')).opts(title="Daily Temperature Mean", ylabel="Temperature", xlabel='Day', shared_axes=False,tools=['hover'],show_grid=True)
(dist + curve).opts(width=400, height=300)

### variance of temp for inside - outside room temp ?
><div class="alert alert-info" role="alert">
>Answer : 
><ul>
><li>As shown below, <u>outside temperature has larger variance</u> than inside temperature.</li>
></ul>
></div>

In [None]:
in_var = hv.Violin(org_df['In'].values, vdims='Temperature').opts(title="In Temperature Variance", box_color='red')
out_var = hv.Violin(org_df['Out'].values, vdims='Temperature').opts(title="Out Temperature Variance", box_color='blue')
(in_var + out_var).opts(opts.Violin(width=400, height=300,show_grid=True))

### Predict the next scenario?
><div class="alert alert-info" role="alert">
>Answer : 
><ul>
><li>We built the forecasting model by using <b>Prophet</b>.</li>
><li>Predicting next 30 points(about a month), it looks that the model generated future points at a certain accuracy.</li>
></ul>
></div>

In [None]:
run_prophet('In',30, False)
run_prophet('Out',30, False)

<a href="#top" class="btn btn-success btn-sm active" role="button" aria-pressed="true" style="color:white;">Table of Contents</a>

# 8. References
>* **Good Pre-processing & EDA notebook**  
>https://www.kaggle.com/satishkundanagar/temp-reading-iot-devices-eda
>* **Prophet Document**  
>https://facebook.github.io/prophet/docs/quick_start.html
>* **Prophet Introduction Paper**  
>https://peerj.com/preprints/3190.pdf

<a href="#top" class="btn btn-success btn-sm active" role="button" aria-pressed="true" style="color:white;">Table of Contents</a>