# Abstract
**This notebook aims to answer "How do human and personal factors influence thermal comfort perception?", a question posed as a task under the ASHRAE Global Thermal Comfort Database II dataset. To answer this, three different random forests will first be developed to identify the most important features of those provided in the dataset. Once the most important features are identified, they will be used in kNN models to predict thermal preference.** One random forest using only person-specific features (e.g. age, sex) as input features will be used to identify the most important person-specific features for predicting thermal preference. A second random forest using non-person-specific features (e.g. season, city, heating strategy) will be used to identify the most important non-person-specific features for predicting thermal preference. A third random forest using both person-specific and non-person-specific features will be used to identify the most important features for predicting thermal preference. For this final random forest, we will develop two kNN models predicting thermal preference (multi-class classification) and compare their performance. One kNN model will use all features and the other kNN model will use only the most important features as identified by this final random forest. **By the end of this notebook, we'll know which features are most important for predicting thermal preference and we will have the accuracy of a model predicting thermal preference using these most important features.**

## Concepts used
- One-hot encoding categorical data
- Train/test split
- Random forest for feature selection 
- kNN for multi-class classification
- Standardizing feature values

Explanations of each of these concepts are not provided in this notebook.


# Data Wrangling, Encoding categorical features
In the dataset, for a row that has a NaN for a particular column, the NaN means that the study did not record this particular feature. Having some NaNs is expected considering that the dataset contains the metadata for field studies on thermal comfort and not all study designs are identical. Since scikit's implementation of a random forest cannot tolerate NaN values, columns from the main dataset that contain many NaNs will not be used. Furthermore, rather than trying to infer the value of those features that have a NaN, any entries/rows that contain a NaN will be omitted. Based on this, the columns we'll use are Year, Season, Koppen climate classification, Climate, City, Country, Building type, Cooling straight_building level, Heating strategy_building level, Age, Sex, Thermal preference, Clo, Met, Air temperature (C), Outdoor monthly air temperature (C), Subject's height (cm), Subject's weight (kg). We will extract these columns and delete any rows that have NaN values.

In [None]:
import numpy as np 
import pandas as pd 

full_df= pd.read_csv('../input/ashrae-global-thermal-comfort-database-ii/ashrae_db2.01.csv', low_memory= False)

#create smaller dataframe with only the person-specific features, non-person-specific features, and what we're trying to predict
personal_feats= ['Age', 'Sex', 'Clo','Met',"Subject«s height (cm)", "Subject«s weight (kg)"]
other_feats= ['Year', 'Season', 'Koppen climate classification', 'Climate', 'City', 'Country', 'Building type', 'Cooling startegy_building level', 'Heating strategy_building level','Air temperature (C)', 'Outdoor monthly air temperature (C)']
label= ['Thermal preference']
df= full_df[personal_feats + other_feats + label]
df= df.dropna()
print('Dataset size: ', len(df))
df.head()

We can also see that some of our features (columns) are numerical while others are categorical. For each of the categorical features, we'll see how many different levels there are and decide how we should encode these features.

In [None]:
from sklearn.model_selection import train_test_split

#define features and labels
feats= df.loc[:, df.columns != label[0]]
labs= df.loc[:, df.columns == label[0]]

#create 90/10 train/test split
x_train, x_test, y_train, y_test= train_test_split(feats, labs, test_size= 0.1, random_state= 0)

#see how many different levels there are for each categorical feature
categ_feats= feats.select_dtypes(include= ['category', object]).columns
for i in categ_feats:
    print(i, df[i].nunique(), '\n')  

Since there are not many levels for each of our categorical features, we will one-hot encode each categorical feature and replace the original columns with their one-hot encodings. *Note: how the one-hot encoding has been implemented in this notebook requires that any additional/future test sets must not have different levels from the levels in the training set for each of the encoded categorical features.* Then, although our dataset is small (1043 examples), we will use a 90/10 train/test split. The same split will be used for all random forests and kNN models.

In [None]:
from sklearn.preprocessing import OneHotEncoder

encoder= OneHotEncoder(handle_unknown= 'ignore', sparse= False)

OH_feats= pd.DataFrame(encoder.fit_transform(feats[categ_feats])) #encode the categorical features
OH_feats.columns = encoder.get_feature_names(categ_feats) #ensure encoded col names are meaningful
OH_feats.index= feats.index #indices need to match in order to add one-hot encodings to original dataframe
feats= feats.drop(categ_feats, axis= 1) #delete categorical cols from original dataframe
feats= pd.concat([feats, OH_feats], axis= 1) 

#create 90/10 train/test split
x_train, x_test, y_train, y_test= train_test_split(feats, labs, test_size= 0.1, random_state= 0)

# Random Forest - Personal Features
Now we will train a random forest predicting thermal preference using only the personal features as input features and identify which of these features are most important for predicting thermal preference. For simplicity, for the random forest, the default settings for all hyperparameters will be used except for the number of trees (estimators) and random state. Once we've fit our tree, we'll then determine the importance weights for each of the features and identify the features with an importance weight greater than the median importance weight. **These features will be our most important person-specific features for predicting thermal preference.**

Since we will be completing this process two additional times, we'll write a function that accomplishes all of this.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

def important_feats(x_train, y_train, x_test):
    """
    Function that fits a random forest and extracts the importance weight from the forest for each feature to determine which features are most important
    (Features with an importance weight greater than the median weight are most important)
    
    INPUTS: x_train is a pandas dataframe where each row is one example and each column is a feature (training data)
    y_train is a pandas dataframe with the corresponding labels to each example in x_train
    x_test is a pandas dataframe where each row is one example and each column is a feature (test data)
    
    OUTPUTS: x_train_new is the same as x_train except with only the most important features retained
    x_test_new is the same as x_test except with only the most important features retained

    """
    #define and fit tree
    forest= RandomForestClassifier(n_estimators= 1000, random_state= 0)
    forest.fit(x_train, np.ravel(y_train))

    #select most important features
    selector= SelectFromModel(forest, threshold= 'median')
    selector.fit(x_train, np.ravel(y_train))
    important_feats= np.array([]) #store the names of the most important features
    for i in selector.get_support(indices= True):
        important_feats= np.append(important_feats, x_train.columns[i])
    
    #return only the most important features (for both training and test sets)
    x_train_new= pd.DataFrame(selector.transform(x_train), columns= important_feats)
    x_test_new= pd.DataFrame(selector.transform(x_test), columns= important_feats)
    
    return important_feats, x_train_new, x_test_new


#redefine the columns that are person-specific features (names are different now because of the one-hot encoding!)
personal_feats= ['Sex_Female', 'Sex_Male', 'Age', 'Clo', 'Met', 'Subject«s height (cm)', 'Subject«s weight (kg)']

#for forest that uses only person-specific features:
x_train_personal= x_train.loc[:, personal_feats]
x_test_personal= x_test.loc[:, personal_feats]

#identify the most important person-specific features:
personal_important_feats, x_train_personal_new, x_test_personal_new= important_feats(x_train_personal, y_train, x_test_personal)
print(personal_important_feats)

**Therefore, when considering only person-specific features, the most important features for predicting thermal preference are Clo, Met, Subject height, and Subject weight. The values of these features are the most informative for predicting one's thermal preference.** For curiousity's sake, we'll train and evaluate a kNN model that uses only these features. First we'll do standardize the features such that each feature has a mean of 0 and unit variance. Just like the random forests, we'll use the default hyperparameter settings for this and all other kNN models. To evaluate the kNN model, we'll use the average recall for each class.

# kNN - Personal features

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import balanced_accuracy_score

def train_eval_knn(x_train, y_train, x_test, y_test):
    """
    Function that trains and tests a kNN multi-class classifier and returns the average recall on the test set
    
    INPUTS: x_train and x_test are 2D numpy arrays where each row is one example and each column is a feature,
    y_train and y_test are pandas dataframes with the corresponding label for the examples in x_train/x_test
    OUTPUT: test_recall is the average recall for the test set (single float value)
    """
    knn= KNeighborsClassifier()
    knn.fit(x_train, np.ravel(y_train))
    #note: default settings on kNN uses the Euclidean distance as the distance metric and equally weights all examples
    
    test_predicts= knn.predict(x_test)
    test_recall= balanced_accuracy_score(y_test, test_predicts)
    
    return test_recall

#scale the training data so that all features have a mean of 0 and unit variance
scaler= StandardScaler()
x_train_personal_new= scaler.fit_transform(x_train_personal_new)

#scale the test data using the mean and variance calculated from the training data
x_test_personal_new= scaler.transform(x_test_personal_new)

#train and test kNN that uses the most important person-specific features
personal_recall= train_eval_knn(x_train_personal_new, y_train, x_test_personal_new, y_test)
print(personal_recall)

A recall of 0.37 is not great! Can we achieve better prediction performance if we consider non-person-specific features? Let's see - we'll do the same feature selection approach and training and testing a kNN model using only non-person-specific features.

# Random forest & kNN - Other features

In [None]:
#redefine the columns that are non-person-specific features (names are different now because of the one-hot encoding!)
other_feats= set(x_train.columns) - set(personal_feats)

#get training and test training sets that have only the non-person-specific features
x_train_other= x_train.loc[:, other_feats]
x_test_other= x_test.loc[:, other_feats]

#identify the most important non-person-specific features:
other_important_feats, x_train_other_new, x_test_other_new= important_feats(x_train_other, y_train, x_test_other)
print(other_important_feats)

#scale feature values before running kNN...
x_train_other_new= scaler.fit_transform(x_train_other_new)
x_test_other_new= scaler.transform(x_test_other_new)

#train and test kNN that uses the most important non-person-specific features
other_recall= train_eval_knn(x_train_other_new, y_train, x_test_other_new, y_test)
print(other_recall)

**Therefore, when considering only non-person-specific features, the most important features for predicting thermal preference are Outdoor monthly air temperature, Climate, City, Koppen climate classification, Season, Air temperature, and Year. The average recall using these most important non-person-specific features in a kNN model is ~0.59.** Interesting! Our predictions are better when we use non-person-specific features as input to our model compared to when we use person-specific features. Let's see what the most important features are and what our predictive performance is like when we consider both the person-specific and non-person-specific features.

# Random forest & kNN - All features

In [None]:
#identify the most important features:
important_feats, x_train_new, x_test_new= important_feats(x_train, y_train, x_test)
print(important_feats)

#scale feature values before running kNN...
x_train_new= scaler.fit_transform(x_train_new)
x_test_new= scaler.transform(x_test_new)
x_train= scaler.fit_transform(x_train)
x_test= scaler.fit_transform(x_test)

#train and test kNN that uses the most important features
recall= train_eval_knn(x_train_new, y_train, x_test_new, y_test)
print(recall)

#train and test kNN that uses all of the features, not just the most important ones
recall_allfeats= train_eval_knn(x_train, y_train, x_test, y_test)
print(recall_allfeats)

**Therefore, when considering both person-specific and non-person-specific features, the most important features for predicting thermal preference are Age, Clo, Met, Subject height, Subject weight, Sex, Outdoor monthly air temperature, Climate, City, Season, and Koppen climate classification. The average recall using these most important features in a kNN model is ~0.58. Whether we use only these most important features or all of the features, we get almost the same average recall.** Since we get the same average recall either way, this means that the excluded features do not improve the model's predictive performance very much and don't need to be included. Future thermal comfort studies that will be added to the ASHRAE database can be designed in such a way so that these excluded features are not measured.

# Conclusions
The purpose of this notebook was to answer "How do human and personal factors influence thermal comfort perception?" using metadata from thermal comfort studies. When considering only person-specific features to predict thermal preference, the most important features to predict thermal preference are Clo, Met, Subject height, and Subject weight. **This suggests that these personal factors are the ones that most strongly influence thermal comfort perception.** However, these person-specific features are not the best to use to predict thermal preference. Instead, the non-person-specific features of Outdoor monthly air temperature, Climate, City, Koppen climate classification, Season, Air temperature, and Year yield the best average recall (~0.59) in a kNN model used to predict thermal preference. A kNN model that used these non-person-specific features performed better than a kNN model that used both person-specific and non-person-specific features, suggesting that environmental factors rather than personal factors most strongly influence thermal comfort perception.

## Suggestions for future work
After data wrangling, the final dataset used for the analyses was very small (1043 examples compared to the original >100,000). Future work should consider using different features and/or using a different technique to deal with NaN values to increase the dataset size. Also along these lines, since the dataset was so small, using cross-validation rather than a train/test split likely would have been a better/more representative approach to the random forest and kNN models. A final suggestion for those wanting to pursue this analysis further is to perform a grid search to determine the best hyperparameter settings for the kNN models rather than using the default settings. By finding the optimal hyperparameter settings, there's a possibility that the average recall for the various kNN models tested here would be higher. Overall I hope this notebook acts as a good spring board for those wanting to further explore "How do human and personal factors influence thermal comfort perception?" 
