Problem question:
Can the CO2 concentration, room air humidity, room temperature, and luminosity data be used to identify whether a room has occupants or not? 

The passive infrared (PIR) sensor measures the occupancy in a room (target label). 

Task:   
1-Data visualization (heat maps, plotted confusion matrix, and decision trees), 2-Data cleaning (handling NaNs, datatypes, and labels),   
3-Preprocessing (handling classification data, adapt colum to boolean),   
4-feature engineering (defined boolean based on target needs),    
5-model building (prepare decision tree algorithm),   
6-model training (spliting the data and running algorthm),   
7-evaluation code (F1 scores, accuracy, precision, recall, and confusion matrix).    

Result:   
Using the algorithm below, it seems that CO2 concentration, room air humidity, room temperature, and luminosity could help predict if a room is occupied or not.

In [None]:
# This Python 3 environment comes with analytics libraries installed
# as defined by the kaggle/python Docker 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Text file content:   


In [None]:
#Readme text file describes the dataset
readme = open("../input/smart-building-system/KETI/README.txt", "r")
print(readme.read()) 

Choosing a room to explore the data and create a model:   

In [None]:
#data for a single room: 656
df656light=pd.read_csv('../input/smart-building-system/KETI/656A/light.csv')
df656temp=pd.read_csv('../input/smart-building-system/KETI/656A/temperature.csv')
df656co2=pd.read_csv('../input/smart-building-system/KETI/656A/co2.csv')
df656pir=pd.read_csv('../input/smart-building-system/KETI/656A/pir.csv')
df656hum=pd.read_csv('../input/smart-building-system/KETI/656A/humidity.csv')

The data includes Unix Epoch Time instead of datetime. For simplicity, update the label with UET. I change time to a string label to join dfs easier, but does not affect results. Then, merge dfs. Each df is an indicator (lights, co2, humidity, temperature, and PIR).

In [None]:
#change Unix Epoch Time to string (to use as label)
#1377299095 UET is Friday, August 23, 2013 11:04:55 PM GMT

df656light['1377299095']=df656light['1377299095'].astype(str)
df656temp['1377299095']=df656temp['1377299095'].astype(str)
df656hum['1377299095']=df656hum['1377299095'].astype(str)

In [None]:
#merge light, temperature, and humidity dfs
df656lt = pd.merge(df656light, df656temp, on='1377299095')
df656lth = pd.merge(df656lt, df656hum, on='1377299095')

In [None]:
#rename columns
df656lth.rename(columns = {"1377299095": "Fri, Aug 23, 2013 11:04:55 PM GMT",
                          " 177.00": "lights 177.00", " 24.37": 'temp 24.37',
                          " 49.90": "humidity 49.90"},  
           inplace = True) 
df656lth

The co2 and pir datasets for **Room 656** are uneven with the dataframes for temperature, light, and humidity. To address this, I fill NaNs with zeros 0.

In [None]:
#uneven from other dfs
#first, convert UET to string (as labels)
#then rename columns
df656co2['1377299095']=df656co2['1377299095'].astype(str)
df656co2.rename(columns = {"1377299095": "Fri, Aug 23, 2013 11:04:55 PM GMT",
                          " 578.00": "co2 578.00"},  
           inplace = True) 
df656co2

In [None]:
#combine uneven dataframes and fill NaNs with 0
df656lthco2 = df656lth.combine_first(df656co2)
df656lthco2.fillna(0)

In [None]:
##this is the target data
##passive infrared (PIR) sensor measures the occupancy in a room
#uneven to other dfs
df656pir['1377299096']=df656pir['1377299096'].astype(str)

In [None]:
df656pir.describe()

In [None]:
df656pir.rename(columns = {"1377299096": "Fri, Aug 23, 2013 11:04:55 PM GMT",
                          " 27.00": "PIR 27.00"},  
           inplace = True) 
df656pir

In [None]:
#combine uneven dataframes and fill NaNs with 0
df656all = df656lthco2.combine_first(df656pir)
df656all=df656all.fillna(0)

In [None]:
#1377299096 is Friday, August 23, 2013 11:04:56 PM GMT
df656all.rename(columns = {"1377299096": "Fri, Aug 23, 2013 11:04:56 PM GMT",
                          " 27.00": "PIR 27.00"},  
           inplace = True) 
df656all

In [None]:
df656all.describe()

In [None]:
df656all.corr()

In [None]:
#Approximately 6% of the PIR data is non-zero, indicating an occupied status of the room. 
#The remaining 94% of the PIR data is zero, indicating an empty room.

In [None]:
#Target column
#create a new columns stating if the room is occupied based on PIR
df656all['Occupied_Room'] = np.where(df656all['PIR 27.00']!= 0, True, False)
df656all

In [None]:
df656all.describe(include='all')

In [None]:
correlation=df656all.corr()

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.heatmap(correlation, cmap="Reds")

In [None]:
#machine learning classification model: Decision Tree
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn import metrics

In [None]:
# Data Slicing : splitting the dataset into the training and testing dataset
# using the sklearn module train_test_split
# but, first, separate the target column: Occupied_Room
# X includes 'testdf' attributes and Y contains the target variable 

testdf=df656all[['Fri, Aug 23, 2013 11:04:55 PM GMT', 'co2 578.00',
       'humidity 49.90', 'lights 177.00', 'temp 24.37']]

#all values in those dfs
X = testdf.values[:,:]
Y = df656all['Occupied_Room'].values

In [None]:
# split the dataset for training and testing 
# random_state refers to random number generator, 0 or 1 are the most commonly used 

X_train, X_test, y_train, y_test = train_test_split( 
          X, Y, test_size = 0.3, random_state = 1)

In [None]:
# train a decision-tree algorithm to make predictions 

classifier = DecisionTreeClassifier()
arbol=classifier.fit(X_train, y_train)
arbol

In [None]:
#make predictions
y_pred = classifier.predict(X_test)
y_pred

Confusion matrix

In [None]:
#check for the accuracy of the algorithm (model)
print(confusion_matrix(y_test, y_pred))

In [None]:
conf_matrix = metrics.confusion_matrix(y_test, y_pred)
sns.heatmap(conf_matrix, annot=True,cmap='Greens')

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
accuracy=metrics.accuracy_score(y_test, y_pred)
accuracy

In [None]:
from sklearn import tree
tree.plot_tree(arbol)

In [None]:
#ploting the figure only
fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(arbol,                   
                   filled=True)


Compare algorithm performace by running same model in a different room:

In [None]:
##re-run the algorithm with another room's data (Room 421)
df421light=pd.read_csv('../input/smart-building-system/KETI/421/light.csv')
df421temp=pd.read_csv('../input/smart-building-system/KETI/421/temperature.csv')
df421co2=pd.read_csv('../input/smart-building-system/KETI/421/co2.csv')
df421pir=pd.read_csv('../input/smart-building-system/KETI/421/pir.csv')
df421hum=pd.read_csv('../input/smart-building-system/KETI/421/humidity.csv')

The labels differ for room 421 than those of room 656A. The amount or rows also differ. Only two dfs have the same dimensions, room 656A had 3 even dfs. The date stamps differ among dfs for this room, as well.

In [None]:
df421pir.describe()

In [None]:
df421pir[' 0.00'].unique()

In [None]:
#change Unix Epoch Time to string (to use as label)

df421light['1377299111']=df421light['1377299111'].astype(str)
df421temp['1377299111']=df421temp['1377299111'].astype(str)
df421hum['1377299111']=df421hum['1377299111'].astype(str)

In [None]:
#merge light and humidity dfs
df421lh = pd.merge(df421light, df421hum, on='1377299111')

In [None]:
#combine uneven dataframes and fill NaNs with 0
df421lht = df421temp.merge(df421lh)
df421lht=df421lht.fillna(0)

1377299111 UET is Fri, Aug 23, 2013 11:05:11 PM GMT   
for co2 in room 421, the date stamp is 1377299119,   
or Fri, Aug 23, 2013 11:05:19 PM GMT   
for PIR is Friday, 1377299123 or  Aug 23, 2013 11:05:23 PM GMT   
The difference is only seconds apart.


In [None]:
df421co2['1377299119']=df421co2['1377299119'].astype(str)
df421pir['1377299123']=df421pir['1377299123'].astype(str)

In [None]:
#updated label of dfs to match
df421co2.rename(columns = {"1377299119": "Fri Aug 23 2013 11:05 PM GMT",
                          " 373.00": "CO2 373.00"},  
           inplace = True) 
df421lht.rename(columns = {"1377299111": "Fri Aug 23 2013 11:05 PM GMT",
                          " 22.84": "temp 22.84",
                          " 52.87": "hum 52.87",
                          " 195.00": "light 195.00"},  
           inplace = True)

In [None]:
df421lhtco2 = pd.merge(df421lht, df421co2, on='Fri Aug 23 2013 11:05 PM GMT')
df421lhtco2=df421lhtco2.fillna(0)
df421lhtco2

In [None]:
df421pir.rename(columns = {" 0.00": "PIR 0.00"},  
           inplace = True) 
df421pir

In [None]:
df421all= df421lhtco2.combine_first(df421pir)
df421all=df421all.fillna(0)

In [None]:
df421all.describe(include='all')

In [None]:
correlation2=df421all.corr()
correlation2

In [None]:
sns.heatmap(correlation2, cmap="Blues")

In [None]:
#Target column
#create a new columns stating if the room is occupied based on PIR
df421all['Occupied_Room'] = np.where(df421all['PIR 0.00']!= 0, True, False)
df421all

In [None]:
#split the data
testdf2=df421all[['1377299123', 'CO2 373.00', 'Fri Aug 23 2013 11:05 PM GMT', 'PIR 0.00',
       'hum 52.87', 'light 195.00', 'temp 22.84']]

#all values in the df
X = testdf2.values[:,:]
Y = df421all['Occupied_Room'].values

In [None]:
X_train, X_test, y_train, y_test = train_test_split( 
          X, Y, test_size = 0.3, random_state = 1)

In [None]:
# train another decision-tree algorithm to make predictions 

classifier2 = DecisionTreeClassifier()
arbol2=classifier2.fit(X_train, y_train)
arbol2

In [None]:
#make predictions
y_pred = classifier2.predict(X_test)
y_pred

In [None]:
print(confusion_matrix(y_test, y_pred))

In [None]:
conf_matrix = metrics.confusion_matrix(y_test, y_pred)
sns.heatmap(conf_matrix, annot=True,cmap='magma')

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
accuracy2=metrics.accuracy_score(y_test, y_pred)
accuracy2

In [None]:
tree.plot_tree(arbol2)

In [None]:
#visualization of tree only
fig = plt.figure(figsize=(14,12))
_ = tree.plot_tree(arbol2,                   
                   filled=True)