# Thermal Comfort Analysis & Prediction Model
BPS5229 Individual Assignment
by Liu Renhao (A0111048W)

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
%matplotlib inline

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


# Introduction

The generally accepted definition of thermal comfort according to ASHRAE is "the condition of the mind that expresses satisfaction with the thermal environment".

Thermal comfort is commonly attributed to the following 6 factors:

![](https://www.pae-engineers.com/system/uploads/fae/image/asset/786/1430x950_Human_Comfort.jpg)

However, these factors can be subjective and may not always give accurate results. 

The first part of this study will be to analyze the impact of variables affecitng thermal comfort using the dataset from ASHRAE Global Thermal Comfort Database II.

The second part of this study will be to determine which factors can help to create a better model for predicting thermal comfort. 


# Part 1 - Exploratory Analysis of ASHRAE Global Thermal Comfort Database II

**Loading of data from ASHRAE Global Thermal Comfort Database II:**

In [None]:
raw_data = pd.read_csv("../input/ashrae-global-thermal-comfort-database-ii/ashrae_db2.01.csv")
raw_data

**A summary on the type of information included in this dataset:**

In [None]:
raw_data.info()

For this study, we shall extact the following 10 feature variables which could have an impact on thermal comfort (target variable):

1) Season

2) Koppen climate classification

3) Building type

4) Cooling strategy building level

5) Clo *

6) Met *

7) Air temperature (C) * 

8) Relative humidity (%) *

9) Air velocity (m/s) *

10) Outdoor monthly air temperature

(Factors with * are variables which are commonly found and attributed to thermal comfort models. The sixth variable, radiant temperature (C), was excluded from this study due to missing data which will affect the available size of dataset)

In [None]:
data = raw_data[['Season', 'Koppen climate classification', 'Building type', 'Cooling startegy_building level',
                 'Clo', 'Met', 'Air temperature (C)', 'Relative humidity (%)','Air velocity (m/s)', 
                 'Outdoor monthly air temperature (C)', 'Thermal comfort']]
data

# Dealing with missing data

Checking for missing data in extracted dataset:

In [None]:
data.isnull().sum()

For simplicity, rows with missing data were removed from the dataset:

In [None]:
data = data.dropna()
data

In [None]:
data.isnull().sum()

No more missing data!

Checking data type of columns:

In [None]:
data.dtypes

Converting 'object' data type to 'category':

In [None]:
data = pd.concat([data.select_dtypes([], ['object']),data.select_dtypes(['object']).apply(pd.Series.astype, dtype='category')], axis=1)
data.dtypes

A closer look at 'Thermal comfort' the target variable we are interested in for this study:

In [None]:
data['Thermal comfort'].value_counts()

In [None]:
data = data[data['Thermal comfort'] != 'Na']

data['Thermal comfort'] = data['Thermal comfort'].astype('int64')
data['Thermal comfort'].value_counts(sort=False)

In [None]:
data['Thermal comfort'].value_counts(sort=False).plot(kind='bar', figsize=(8,8))

**Observation: Most data points for thermal comfort are within comfortable and very comfotable range (i.e. 5 & 6).**

In [None]:
data.describe()

In [None]:
data['Season'].value_counts().plot(kind='bar', figsize=(8,8))

**Observation: Most data points are in summer, followed by winter then spring. Least data points available in autuum.**

In [None]:
data['Koppen climate classification'].value_counts().plot(kind='bar', figsize=(8,8))

**Observation: The three most common climate type included in this dataset are 1) hot semi-arid, 2) humid subtropical, 3) tropical wet savanna.**

In [None]:
data['Building type'].value_counts().plot(kind='bar', figsize=(8,8))

**Observation: Most common data points are for office building type.**

In [None]:
data['Cooling startegy_building level'].value_counts().plot(kind='bar', figsize=(8,8))

**Observation: Building cooling strategy is relatively evenly distributed across the 3 types.**

# 1) Thermal comfort VS Air temperature (C):

In [None]:
fig, ax = plt.subplots()
fig.set_size_inches(20,10)
sns.boxplot(x = 'Thermal comfort', y = 'Air temperature (C)', hue = 'Cooling startegy_building level', data = data)

**Observations: In general, the lower the air temperature, the higher the thermal comfort.**

# 2) Thermal comfort VS Relative humidity (%)

In [None]:
fig, ax = plt.subplots()
fig.set_size_inches(20,10)
sns.boxplot(x = 'Thermal comfort', y = 'Relative humidity (%)', hue = 'Cooling startegy_building level', data = data)

**Observations: Difficult to determine if relative humidity on its own has significant impact on thermal comfort.**


# 3) Thermal comfort VS Air velocity (m/s)

Air velocities greater than 1.0 m/s were removed as outliers in order to obtain a more accurate/representative dataset:


In [None]:
data = data[data['Air velocity (m/s)'] < 1.0]

fig, ax = plt.subplots()
fig.set_size_inches(20,10)
sns.boxplot(x = 'Thermal comfort', y = 'Air velocity (m/s)', hue = 'Cooling startegy_building level', data = data)

**Observations: The lower the air velocity, the better the thermal comfort.**


# 4) Thermal comfort VS Clothing insulation

Rounding of clothing insulation to 1 significant figure:

In [None]:
data['Clo'] = data['Clo'].round(1)
data['Clo'].value_counts()

In [None]:
fig, ax = plt.subplots()
fig.set_size_inches(20,10)
sns.boxplot(x = 'Thermal comfort', y = 'Clo', hue = 'Cooling startegy_building level', data = data)

Checking distribution of clothing insulation in dataset:

In [None]:
fig, ax = plt.subplots()
fig.set_size_inches(10,10)
sns.distplot(data['Clo'])

# 5) Thermal comfort VS Metabolic rate

Metabolic rates greater than 3.0 were removed as outliers in order to obtain a more accurate/representative dataset:

In [None]:
data = data[data['Met'] < 3.0]

fig, ax = plt.subplots()
fig.set_size_inches(20,10)
sns.boxplot(x = 'Thermal comfort', y = 'Met', hue = 'Cooling startegy_building level', data = data)

Checking distribution of metabolic rate in dataset:

In [None]:
fig, ax = plt.subplots()
fig.set_size_inches(10,10)
sns.distplot(data['Met'])

Final dataset to be used for creating a prediction model for thermal comfort:

In [None]:
data

# PART 2 - CREATING A PREDICTION MODEL FOR THERMAL COMFORT

Thermal comfort has been selected as the prediction target. Dataset will be divided into target variable (i.e. Thermal comfort) and feature variables. 3 different prediction models for thermal comfort will be explored using random forest classification algorithm.

# Prediction Model 1 (using 5 features)

The following 5 common features in ASHRAE thermal comfort model will be used for the first prediction model:

1) Air temperature (C)

2) Air velocity (m/s)

3) Relative humidity (%)

4) Clo

5) Met


In [None]:
y1 = data['Thermal comfort']

features1 = ['Air temperature (C)', 'Air velocity (m/s)', 'Relative humidity (%)', 'Clo', 'Met']

X1 = data[features1]

X1.describe()

Splitting dataset into training and testing set. We will be using 80% of dataset for training and 20% of dataset for testing:

In [None]:
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size = 0.2, random_state = 10)

Using random forest classification algorithm:

In [None]:
clf1 = RandomForestClassifier(n_estimators = 100)
clf1.fit(X1_train, y1_train)
pred1 = clf1.predict(X1_test)

Checking accuracy:

In [None]:
print('Accuracy score: %.1f' % (accuracy_score(y1_test, pred1)*100))

Reviewing of confusion matrix, horizontal rows are actual values (1-6) and vertical columns are predicted values (1-6). Diagonals are the numbers of values that have been mapped correctly.

In [None]:
confusion_matrix(y1_test, pred1)

# Prediction Model 2 (using 3 features)

Question: Does reducing the number of feature variables help to improve the prediction model?

The following 3 features related to indoor environment conditions will be used for the second prediction model:

1) Air temperature (C)

2) Air velocity (m/s)

3) Relative humidity (%)


In [None]:
y2 = data['Thermal comfort']

features2 = ['Air temperature (C)', 'Air velocity (m/s)', 'Relative humidity (%)']

X2 = data[features2]

X2.describe()

In [None]:
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size = 0.2, random_state = 10)

In [None]:
clf2 = RandomForestClassifier(n_estimators = 100)
clf2.fit(X2_train, y2_train)
pred2 = clf2.predict(X2_test)

In [None]:
print('Accuracy score : %.1f'% (accuracy_score(y2_test, pred2)*100))

**Observation: There is a slight decrease in accuracy when only 3 feature variables are used as compared to 5 feature variables in the first model.**

In [None]:
confusion_matrix(y2_test, pred2)

# Prediction Model 3 (using 10 features)

Including in all 10 feature variables into the third prediction model to see if accuracy score can be improved:

1) Air temperature (C)

2) Air velocity (m/s)

3) Relative humidity (%)

4) Clo

5) Met 

6) Season 

7) Koppen climate classification 

8) Building type

9) Cooling startegy_building level 

10) Outdoor monthly air temperature (C)

In [None]:
y3 = data['Thermal comfort']

features3 = ['Air temperature (C)', 'Air velocity (m/s)', 'Relative humidity (%)', 'Clo', 'Met', 
            'Season', 'Koppen climate classification', 'Building type', 'Cooling startegy_building level', 'Outdoor monthly air temperature (C)']

X3 = data[features3]

X3.describe()

Using one hot encoding to use additional feature variables:

In [None]:
X3 = pd.get_dummies(data = X3, drop_first = True)

In [None]:
X3_train, X3_test, y3_train, y3_test = train_test_split(X3, y3, test_size = 0.2, random_state = 10)

In [None]:
clf3 = RandomForestClassifier(n_estimators = 100)
clf3.fit(X3_train, y3_train)
pred3 = clf3.predict(X3_test)

In [None]:
print('Accuracy score : %.1f '% (accuracy_score(y3_test, pred3)*100))

In [None]:
confusion_matrix(y3_test, pred3)

**Observation: There is an improvement in accuracy of thermal comfort prediction when more feature variables are included.**

Prediction model 3 has the best results (i.e. highest accuracy) when using random forest classification algorithm. 


# Conclusion

Prediction model 3, comprising of all 10 feature variables, has the best results (i.e. highest accuracy) when using random forest classification algorithm.

However, there are limitations to using this prediction model:
* Dataset used for training is more skewed towards hot semi-arid climate 
* Dataset used for training is more skewed towards summer season
* Dataset used for trinaing is more skewed towards office building type
* Approximately 55% accuracy result may not be good enough to predict thermal comfort with high confidence

**Recommendations:**
* Using more diverse and evenly distributed dataset for training (e.g. include more climate types, more different building types)
* Include radiant temperature as one of the feature variables since this is one of the more recognised factor affecting thermal comfort



# End