# How do human and personal factors influence thermal comfort perception?
*Do age, sex, metabolism, and other human factors influence how you feel comfort?*



# The Task - As Set
The following task has been requested:

> **Task Details**
> There is a lot of recent discussion on whether thermal comfort is really taking into consideration the needs of all occupants. There have even been popular news outlets who have covered stories on sexism in office temperatures from the New York Times, the New Yorker, and Time Magazine. The question related to this task is to piece together evidence from the ASHRAE Thermal Comfort Database II to show how factors such as sex, age, metabolism, and other factors influence comfort.
> 
> **Expected Submission**
> Users who want to tackle this task should use the Kaggle Kernels to piece together data and evidence gathered from the data set that provide a foundation for their analysis. The analysis should include a lot of documentation to provide clarity of the techniques and approach used.
> 
> **Relevant Literature**
> A great overview of recent work to analyze these differences can be found in this recent publication: https://www.researchgate.net/publication/324809583_Individual_Difference_in_Thermal_Comfort_A_Literature_Review

# The Data - As Provided
Alongside the data, the following information has been provided:

> Dataset Source: https://datadryad.org/stash/dataset/doi:10.6078/D1F671
> 
> **Abstract**
> Recognizing the value of open-source research databases in advancing the art and science of
> HVAC, in 2014 the ASHRAE Global Thermal Comfort Database II project was launched under
> the leadership of University of California at Berkeley’s Center for the Built Environment and
> The University of Sydney’s Indoor Environmental Quality (IEQ) Laboratory.
> The exercise began with a systematic collection and harmonization of raw data from the last
> two decades of thermal comfort field studies around the world. The final database is comprised
> of field studies conducted between 1995 and 2015 from around the world, with contributors
> releasing their raw data to the project for wider dissemination to the thermal comfort research
> community. After the quality-assurance process, there was a total of 81 846 rows of data of
> paired subjective comfort votes and objective instrumental measurements of thermal comfort
> parameters. An additional 25 617 rows of data from the original ASHRAE RP-884 database are
> included, bringing the total number of entries to 107 463.
> 
> The database is intended to support diverse inquiries about thermal comfort in field settings.
> To achieve this goal, two web-based tools were developed to accompany the database:
> 
> Interactive visualization tool: provides a user-friendly interface for researchers and
> practitioners to explore and navigate their way around the large volume of data in ASHRAE
> Global Thermal Comfort Database II
> Query builder tool: allows users to filter the database according to a set of selection criteria,
> and then download the results of that query in a generic comma-separated-values (.csv) file
> Methods
> In order to ensure that the quality of the database would permit end-users to conduct robust
> hypothesis testing, the team built the data collection methodology on specific requirements, as
> follows:
> 
> Data needed to come from field experiments rather than climate chamber research, so
> that it represented research conducted in “real” buildings occupied by “real” people doing
> their normal day-to-day activities, rather than paid college students sitting in a controlled
> indoor environment of a climate chamber.
> Both instrumental (indoor climatic) and subjective (questionnaire) data were required,
> such that they were recorded in the same space at the same time
> The database needed to be built up from the raw data files generated by the original
> researchers, instead of their processed or published findings.
> The raw data needed to come with a supporting codebook explaining the coding
> conventions used by the data contributor, to allow harmonization with the standardized
> Földváry Ličina, Veronika et al. (2018), ASHRAE Global … doi:10.6078/D1F671
> data formatting within the database.
> Data must have been published either in a peer-reviewed journal or conference paper.
> All datasets from individual studies were subject to a stringent quality assurance process
> before being assimilated into the database. The research team conducted a final
> validation by first comparing each raw dataset with its related publication provided by the data
> contributor to prevent transmission errors. Systematic quality control of each study was
> performed to ensure that records within the database were reasonable. Firstly, distributions of
> each variable were visualized to identify aberrant values. Then, cross-plots between two
> variables (e.g. thermal sensation and thermal comfort) were used to check for incorrectly coded
> data. Finally, a few rows from each study were randomly selected to verify consistency between
> the original dataset and the standardized database. Since the data came from multiple
> independent studies, every record did not necessarily include all of the thermal comfort
> variables. Where data were missing, that particular range of cells was filled with a null value.
> 
> **Usage Notes**
> The dataset is provided as a comma-separated value (.csv) file using UTF-8 character encoding.
> The first row contains human-readable column headers. Each row represents an individual’s
> questionnaire responses, and the associated instrumental measurements, thermal index values
> and outdoor meteorological observations where available. Full details can be found in the
> related work.
> 
> **Funding**
> American Society of Heating, Refrigerating and Air Conditioning Engineers, Award: URP 1656
> 
> **References**
> This dataset is susupplement to https://doi.org/10.1016/j.buildenv.2018.06.022

# Initial Observations

At first glance, this dataset appears to be extremely rich in observations. I note for example, that not only are temperatures observed at 0.1m, but also at 0.6 and 1.1m above the floor. Similarly, both metric and imperial units will be used.

As such, I will default to using only metric (degrees C) measurements. Similarly, I will default to using the SET (standard effective temperature) unless a true temperature observation is required. In such a case, observations taken at 1.1m above the floor will be used, since this approximates the centre of a human's core.

# Initial Strategy

Since our objective is to investigate the parameters that influence thermal comfort perception, the following variables will be dependent variables:

* Thermal Sensation
* Thermal Acceptability
* Thermal Preference
* Thermal Comfort

We should seek to examine how these variables are affected by the temperature (as defined by SET), and in turn, how this relationship is affected by other (independent) variables.

Althugh I will perform a matrix analisis later in this investigation "common-sense" independent variables to investigate include:

* Climate Classification (do people from different climates require different temperatures)
* Outdoor temperature (does outside temperature affect comfort)
* Building type (do people need different temperatures for different activities)
* Country
* Age
* Sex
* Weight (do people of different weights require different conditions, and does weight vary by region)
* Height (do people of different heights require different conditions, and does weight vary by region)
* Metabolic Activity (How activity affects people - do some compensate better than others)
* Humidity (do some people tolerate humidity better?)

In order to examine these variables, we will need to control for confounding variables. For example, when examining Sex or Age, we will need to ensure that we maintain a constant proportion of say, Building Type & subjects from each city/country.

The most obvious strategy to do this, would be to use a combination of normalisation, binning and weighting techniques to create a standard representative "breadbasket".

On a personal note, I am a data analyst with a physics background - my domain knowlege is somewhat limited. Although I will endevour to do background reading, I will focus my efforts on what information can be derrived from statistics, rather than modeling of bioloigical systems etc. 

# Data Import

In [None]:
import numpy as np # Linear Algebra
import pandas as pd # Data Processing , CSV file I/O

#List the files in the input dataset

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
#Lets read the CSV blindly - just to see how it handles it.

df = pd.read_csv('../input/ashrae-global-thermal-comfort-database-ii/ashrae_db2.01.csv')

df.head(1)

In [None]:
#Now lets try it, just selecting columns of interest, and specifying the dtype.

cols = [
'Publication (Citation)','Year','Season','Koppen climate classification','Climate','City','Country',
'Building type','Age','Sex','Thermal sensation','Thermal sensation acceptability','Thermal preference',
'Thermal comfort','SET','Met','Air temperature (C)',
'Ta_h (C)','Relative humidity (%)','Humidity preference','Humidity sensation','Subject«s height (cm)',
'Subject«s weight (kg)','Outdoor monthly air temperature (C)'
]

Dtypes = {
'Publication (Citation)':'string','Season':'category','Koppen climate classification':'category',
'Climate':'category','City':'category','Country':'category','Building type':'category',
'Sex':'category','Thermal preference':'category'
}

df = pd.read_csv('../input/ashrae-global-thermal-comfort-database-ii/ashrae_db2.01.csv',usecols=cols, dtype = Dtypes, low_memory=False, na_values=['Nan','Na','NaN','N/A','na'])

df.head(1)

In [None]:
#Lets check that it works as we expected.

print(df.shape)
print(df.dtypes)
print(df.info)

As you can see, there's a lot of NaN data in our system - something we'll need to take into account when performing our analysis. 

We have some 107,585 entrys - so it's quite a large and impressive dataset!

In [None]:
#How many unique publications?

print('There are ' + str(df['Publication (Citation)'].nunique()) + ' unique publications.')

#How many unique cities?

print('There are ' + str(df['City'].nunique()) + ' unique cities.')

#What range of years?

print('The range of years is from ' + str(df['Year'].min()) + ' to ' + str(df['Year'].max()) + '.')

It's a good thing that we have so many entries - we'll probably need to subset quite significantly - otherwise we might accidently capture trends in time or city.

Given how many publications, we'll also need to see how consistant they are - or if they have any biases.

# Basic Stats

Let's start by just looking at the basics.

In [None]:
#Let's examine some basic stats

df.describe()


Whilst not particularly useful, this does provide some helpful data for our ranges. For example, the IQR of SET is from 27.63 C to 23.71 C, whilst the IQR of Air Temperature ranges from 26.4 C to 22.3 C. Similarly, our humidity IQR is from 59.4 to 35.3 % and outdoor air temperature from 25 C to 10 C.

These do provide something of a rough framework for where we could expect our numbers to be. They would suggest that the majority of our measurements are taken in temperature to tropoical regions, though with a few extreme values (e.g. outdoor temperature -18.4).

It's also worth noting the mean age of 32, suggesting that our data skews towards young-mid working age subjects.

In [None]:
#Count the number of examples of each category.
count_climates=df.Climate.value_counts(sort=False)
count_cities=df.City.value_counts(sort=False)
count_year=df.Year.value_counts(sort=False)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(16, 20))

plt.subplot(3, 1, 1)
count_climates.plot(kind='bar')

plt.subplot(2, 1, 2)
count_year.plot(kind='bar')

plt.show()

From this, we can conclude that the years are fairly spread out, with rough peaks in the mid 1990s and  early 2010s.

However, the climate data is perhaps not particularly useful. There are a lot of categories, that will probably need simplifying. In which case, either we should work from numerical temperature and humidity data, or else simplify, perhaps to warm, temperate and cold climates.

In [None]:
plt.figure(figsize=(16, 20))
count_cities.plot(kind='bar')

This data suggests that the data available is quite unevenly spread out between cities. It also includes non-city data, e.g: "Midlands" and "Hampshire".

As you might expect, it also appears to be somewhat biased towards wealthier nations, with a great many of the count focused on the UK & USA. This will need to be accounted for in our analisis, if we want our conclusions to represent humans as a whole, rarther than just wealthier nations.

# SET Data

The standardised SET data would be nice to use, but I note that the column features many NaN values in it. Let's examine this data to see if it can be safely used.