Applying K- Means clustering on the occupancy detection Dataset to cluster the outcome whether there was occupancy or not.

Dataset:https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+


Attribute Information:

date time year-month-day hour:minute:second 
Temperature, in Celsius 
Relative Humidity, % 
Light, in Lux 
CO2, in ppm 
Humidity Ratio, Derived quantity from temperature and relative humidity, in kgwater-vapor/kg-air 
Occupancy, 0 or 1, 0 for not occupied, 1 for occupied status

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import scale
%matplotlib inline

In [3]:
df = pd.read_csv('OccupancyDetectionTrain.csv',index_col=0)

In [4]:
df.head()

Unnamed: 0,date,Temperature,Humidity,Light,CO2,HumidityRatio,Occupancy
1,2015-02-04 17:51:00,23.18,27.272,426.0,721.25,0.004793,1
2,2015-02-04 17:51:59,23.15,27.2675,429.5,714.0,0.004783,1
3,2015-02-04 17:53:00,23.15,27.245,426.0,713.5,0.004779,1
4,2015-02-04 17:54:00,23.15,27.2,426.0,708.25,0.004772,1
5,2015-02-04 17:55:00,23.1,27.2,426.0,704.5,0.004757,1


In [11]:
df.drop('date', axis=1, inplace=True)

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8143 entries, 1 to 8143
Data columns (total 6 columns):
Temperature      8143 non-null float64
Humidity         8143 non-null float64
Light            8143 non-null float64
CO2              8143 non-null float64
HumidityRatio    8143 non-null float64
Occupancy        8143 non-null int64
dtypes: float64(5), int64(1)
memory usage: 445.3 KB


In [None]:
sns.set_style('whitegrid')
sns.lmplot('Room.Board','Grad.Rate',data=df, hue='Private',
           palette='coolwarm',size=6,aspect=1,fit_reg=False)

In [13]:
df.describe()

Unnamed: 0,Temperature,Humidity,Light,CO2,HumidityRatio,Occupancy
count,8143.0,8143.0,8143.0,8143.0,8143.0,8143.0
mean,20.619084,25.731507,119.519375,606.546243,0.003863,0.21233
std,1.016916,5.531211,194.755805,314.320877,0.000852,0.408982
min,19.0,16.745,0.0,412.75,0.002674,0.0
25%,19.7,20.2,0.0,439.0,0.003078,0.0
50%,20.39,26.2225,0.0,453.5,0.003801,0.0
75%,21.39,30.533333,256.375,638.833333,0.004352,0.0
max,23.18,39.1175,1546.333333,2028.5,0.006476,1.0


In [14]:
from sklearn.cluster import KMeans

In [15]:
kmmodel = KMeans(n_clusters=2)

In [16]:
kmmodel.fit(df.drop('Occupancy',axis=1)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [17]:
kmmodel.cluster_centers_

array([[  2.02935411e+01,   2.50674363e+01,   3.34370110e+01,
          4.66834950e+02,   3.67512399e-03],
       [  2.17638091e+01,   2.80666210e+01,   4.22216094e+02,
          1.09782167e+03,   4.52141198e-03]])

We now compare the clusters with the values of atual occupancy we had available in the original dataset just to evaluate the accuracy of the K-Means model.

In [19]:
from sklearn.metrics import confusion_matrix,classification_report
print(confusion_matrix(df['Occupancy'],kmmodel.labels_))
print(classification_report(df['Occupancy'],kmmodel.labels_))

[[6159  255]
 [ 181 1548]]
             precision    recall  f1-score   support

          0       0.97      0.96      0.97      6414
          1       0.86      0.90      0.88      1729

avg / total       0.95      0.95      0.95      8143



As we can see here the clusters which were formed have a very decent average precision of 95%. 