# **Introduction**

ASHRAE Global Thermal Comfort dataset is another fascinating dataset that combines objective and subjective data. The combination allows for an interesting perspective on how real people are affected by HVAC systems. This unique perspective will allow us to conduct a dive into the world of HVAC science and hopefully emerge with new insights.

My initial idea for this dataset is to conduct a geographical analysis of HVAC systems and thermal preferences. We will do a lot of geographical plotting combined with some spider graphs and stacked bar graphs. The end goals are:

Exploring how geography and climate affect HVAC systems and thermal preferences.
Find possible correlations between thermal preferences and subjective data.
Determine what factors influence thermal dissatisfaction.


# **Importing libraries and data**


In addition to the ASHRAE dataset, I am going to import the city dataset for its location data. Please note that there will be some changes to the location data that we need to apply manually.

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import plotly.graph_objects as go

import warnings
warnings.filterwarnings("ignore")

import datetime

#Define plot style
plt.style.use("fivethirtyeight")

import gc

In [None]:
df = pd.read_csv(r"../input/ashrae-global-thermal-comfort-database-ii/ashrae_db2.01.csv")
df_city = pd.read_csv(r"../input/cities-of-the-world/cities15000.csv",encoding = "ISO-8859-1") # Location data

# **Initial data analysis**

We are going to take a sneak peek at the data and determine how much of it we are going to use.

In [None]:
df.head()

The dataset has a lot of columns, and many of them have less than 40% of their data. Additionally, there are duplicate columns (for example temperature data in both Fahrenheit and Celsius). We are going to cover them and determine the ones that need dropping.

In [None]:
df.info()

In [None]:
to_drop = ["Air temperature (F)",#Duplicate data, we have it in celsius.
           "Ta_h (F)","Ta_m (F)",#Duplicate data.
           "Ta_l (F)","Operative temperature (F)", #Duplicate data.
           "Radiant temperature (F)", #Duplicate data.
           "Globe temperature (F)", #Duplicate data
           "Tg_h (F)", #Duplicate data
           "Tg_m (F)", #Duplicate data
           "Tg_l (F)", #Duplicate data
           "Publication (Citation)", #Unnecessary for the analysis
           "Data contributor",#Unnecessary for the analysis
           "Database", #Unnecessary for the analysis
           "Air velocity (fpm)", #Duplicate data
           "Velocity_h (fpm)", #Duplicate data
           "Velocity_m (fpm)", #Duplicate data
           "Velocity_l (fpm)", #Duplicate data
           "Outdoor monthly air temperature (F)",#Duplicate data
           "Blind (curtain)", #Unnecessary for the analysis
           'Fan', 'Window', #Unnecessary for the analysis
           'Door','Heater', #Unnecessary for the analysis
           'activity_10', #Unnecessary for the analysis
           'activity_20', #Unnecessary for the analysis
           'activity_30', #Unnecessary for the analysis
           'activity_60' #Unnecessary for the analysis
          ]

df.drop(to_drop, axis=1, inplace=True)

 # **Exploratory data analysis**
 

As previously mentioned, the main focus of this notebook will be on the geographical analysis of the thermal dataset. To that end, we will first be plotting all buildings on a world map. It is important to note that this dataset covers the thermal preference of subjects in these buildings over almost a century. To simplify, it means that the number of entries doesn't equal number of unique buildings.

We will need to merge the location dataset with the thermal dataset. Before we do that, we will need to address some data mismatches on the dataset. In particular, the city of "Midlands" exists in both the US and UK. Please note that Midlands is not an actual city in this context, but an area.

**Location data preperation**

In [None]:
geo_df = df.groupby("City")["City"].agg("size")
geo_df = geo_df.reset_index(name="Count")

df_city = df_city[["asciiname", "latitude", "longitude"]]
df_city.rename(columns = {"asciiname" : "City","latitude" : "Lat", "longitude" : "Lng"},inplace=True)
df_city.drop_duplicates(subset="City",inplace=True)

#Midland is in the UK
df_city.loc[(df_city.City == "Midland"),"Lat"]= 52.489471
df_city.loc[(df_city.City == "Midland"),"Lng"]= -1.898575

geo_df = pd.merge(geo_df,df_city,how="left", on="City")

geo_df.sort_values(by="Count",ascending=False, inplace=True)

In [None]:
fig = go.Figure(go.Scattergeo(lon=geo_df["Lng"],
                              lat=geo_df["Lat"],
                              text=geo_df["City"] + "<br>Count: " + geo_df["Count"].astype(str),
                              marker = dict(
                                  size = geo_df["Count"]/1000,
                                  line_width = 0,sizemin=5)
                             )
               )


fig.update_layout(title_text = "Geographical distribution fo the buildings")

fig.update_geos(projection_type="natural earth")

fig.show()

Most data steams from Europe, USA and India. We will do a quick part plot analysis to determine the exact numbers for the cities/countries.

In [None]:
#Data
geo_df.sort_values(by="Count",ascending=True, inplace=True)

#Plot
ax, fig = plt.subplots(figsize=(10,5))

plt.barh(geo_df["City"][-10:],geo_df["Count"][-10:])

plt.ylabel("Cities", fontsize=18, alpha=.75)
plt.xlabel("Number", fontsize=18, alpha=.75)

plt.yticks(alpha=0.75,weight="bold")
plt.xticks(alpha=0.75)

plt.title("Most enteries per city",alpha=0.75,weight="bold",fontsize=20, loc="left")

In [None]:
#Plot
ax, fig = plt.subplots(figsize=(10,5))

plt.barh(df["Country"].value_counts().index,df["Country"].value_counts())

plt.ylabel("Countries", fontsize=18, alpha=.75)
plt.xlabel("Number", fontsize=18, alpha=.75)

plt.yticks(alpha=0.75, fontsize=10)
plt.xticks(alpha=0.75)

plt.title("Most enteries per country",alpha=0.75,weight="bold",fontsize=20, loc="left")

del geo_df

Now we will shift our focus on the heating and cooling strategies each of these entries uses. We will try to determine and plot the most frequently occurring strategies in this dataset.

Unfortunately, the first issue happens here. The heating strategies column has a large number of missing values and is unsuitable for plotting and analysis due to the remaining data only being one category. That is why our focus completely shifts to the cooling strategies of the dataset.  This focus will further allow us to present the data in a couple of ways and it should be fun.

We will present the different strategies on an interactive geo plot using plotly. Plotly geoplot can sometimes get a bit crowded so please click on each value of the legend to filter it out.

In [None]:
cooling_geo = df.groupby(["City","Cooling startegy_building level"])["City"].agg("size")
cooling_geo = cooling_geo.reset_index(name="Count")
cooling_geo = pd.merge(cooling_geo,df_city,how="left", on="City")

gc.collect()

In [None]:
fig = go.Figure()

for i in cooling_geo["Cooling startegy_building level"].unique():

    df_part = cooling_geo[cooling_geo["Cooling startegy_building level"] == i]
    fig.add_trace(go.Scattergeo(
        lon = df_part["Lng"],
        lat = df_part["Lat"],
        text= df_part["City"] + "<br>Count: " + df_part["Count"].astype(str),
        name = i,
        marker = dict(
            size = df_part["Count"]/25,
            line_color='rgb(40,40,40)',
            line_width=1.5,
            sizemode = 'area'
        )
    ))

fig.update_layout(dict(
        title = "Geographical cooling strategies (click on the legend to filter data)",
        height=450,
        geo = dict(
            showland = True,
            landcolor = "rgb(250, 250, 250)",
            subunitcolor = "rgb(217, 217, 217)",
            countrycolor = "rgb(217, 217, 217)",
            countrywidth = 0.5,
            subunitwidth = 0.5
        )))
fig.update_geos(projection_type="natural earth")

fig.show()

To summarize the different strategies per location, we will convert each strategy to a % and present it as a stacked bar plot.

In [None]:
table = pd.pivot_table(df[["Climate","Cooling startegy_building level"]],
                       index=["Climate"],columns=["Cooling startegy_building level"],
                       aggfunc=len,
                       fill_value=0)

In [None]:
def conv_to_per(df):

    """
    Converts columns to %
    """
    to_drop = []
    df["Sum"] = np.sum(table,axis=1)
    for i in df:
        
        df[i + " percent"] = np.round(df[i] / df["Sum"] * 100,2)
        to_drop.append(i)

    df = df.drop(to_drop,axis=1)
    df = df.drop("Sum percent",axis=1)

    return df

In [None]:
#Data
table = conv_to_per(table)

#Plot
table.plot.barh(stacked=True,figsize=(15,10))

plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.05),
          fancybox=True, ncol=5)

plt.title("Percentage of cooling strategies per climate",
          alpha=0.75,weight="bold",fontsize=20, loc="left")

Another interesting way to present this data is via spider/radar plots. Researching for this plot, I found that many methods for it are a bit complicated and difficult to understand. The code I wrote for it requires very little alteration and is more straightforward.

In [None]:
fig, axs = plt.subplots(figsize=(15,40))

lab = table.columns

for i in range(19):
    
    data = table.iloc[i,:]
    data_adjusted = np.concatenate((data,[data[0]]))
    label_palce = np.linspace(start=0, stop=2*np.pi, num=len(data_adjusted))

    plt.subplot(10,2,1+i,polar=True)
    plt.title(table.index[i], fontsize=18)
    plt.tight_layout()
    plt.subplots_adjust(top=1.5)
    plt.plot(label_palce, data_adjusted)
    lines, labels = plt.thetagrids(np.degrees(label_palce), labels=lab)
    

# **Thermal preference/comfort**

This is the central piece of the whole dataset. Thermal comfort in this context expressed the satisfaction with the thermal environment and is appraised by a subjective evaluation as stated by the ANSI/ASHRAE Standard 55. One of the major goals of HVAC systems is maintaining the standard thermal comfort of the building's occupants. To achieve this the temperature must be in equilibrium with the occupants and allow the human-generated metabolism heat to disperse - meaning not too hot and not too cold. Many factors affect this process like metabolic rate, clothing insulation, air temperature, mean radiant temperature, airspeed and relative humidity. There are also some psychological factors involved, but we will not dwell that deep into that part of the subject. All of this means that the thermal comfort zone is not a fixed point, but rather a flexible range that is unique to the individual.


First and foremost we are going to look into the subjective data of the individual if they are content with the temperature in the building.

In [None]:
ax, fig = plt.subplots(figsize=(10,7))

plt.bar(df["Thermal preference"].value_counts().index,df["Thermal preference"].value_counts())

plt.ylabel("Number", fontsize=18, alpha=.75)
plt.xlabel("Subject temperature satisfaction", fontsize=18, alpha=.75)

plt.yticks(alpha=0.75, fontsize=10)
plt.xticks(alpha=0.75)

plt.title("Subjective temperature review",alpha=0.75,weight="bold",fontsize=20, loc="left")

gc.collect()

A lot of people are content with the temperature situation in their respective buildings. However, there are still quite a lot of occupiers unsatisfied with the current situation.

In [None]:
ax, fig = plt.subplots(figsize=(10,5))

sns.distplot(df["Thermal sensation"].dropna())

plt.ylabel("",alpha=.75)
plt.xlabel("Subject thermal sensation", fontsize=18, alpha=.75)

plt.yticks(alpha=0.75, fontsize=10)
plt.xticks(alpha=0.75)

plt.title("Distribution of thermal sensation",alpha=0.75,weight="bold",fontsize=20, loc="left")

gc.collect()

The thermal sensation scale goes from -3 to +3 as represents how hot/cold a subject is. Extremes of this scale are used to showcase major discomfort with the temperature situation of the individual sites. Plotting this information results in a Gaussian-shaped distribution with most occupiers being content.

Now, lets to a quick linear correlation check using a heatmap. The columns I choose for this particular heatmap are all related to the individual subject.

In [None]:
fig, ax = plt.subplots(figsize=(12,7))

#Data
df_numeric = df[["Age","Sex","Clo","Met","Subject«s height (cm)","Subject«s weight (kg)","Thermal sensation"]]
df_numeric = df_numeric.corr()

#Heatmap
ax = sns.heatmap(df_numeric, annot=True,annot_kws={"size": 14},linewidths=.5,center=0,cbar=False)

#Heatmap bug fix
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)

del df_numeric

gc.collect()

There is almost no linear correlation between the variables. However, this doesn't mean that the individual columns don't have any correlation between them. There is still non-linear correlation and we are still using multi-category data for the analysis which might have an impact.

 Before we go into the deeper analysis we are going to plot and mark the geographical locations of all thermal preferences and try to find any insights.

In [None]:
termal_pref = df.groupby(["City","Thermal preference"])["City"].agg("size")
termal_pref = termal_pref.reset_index(name="Count")


termal_pref = pd.merge(termal_pref,df_city,how="left", on="City")

In [None]:
fig = go.Figure()

for i in termal_pref["Thermal preference"].unique():

    df_part = termal_pref[termal_pref["Thermal preference"] == i]
    fig.add_trace(go.Scattergeo(
        lon = df_part["Lng"],
        lat = df_part["Lat"],
        text= df_part["City"] + "<br>Count: " + df_part["Count"].astype(str),
        name = i,
        marker = dict(
            size = df_part["Count"]/25,
            line_color='rgb(40,40,40)',
            line_width=1.5,
            sizemode = 'area'
        )
    ))

fig.update_layout(dict(
        title = "Geographical thermal preference (click on the legend to filter data)",
        height=450,
        geo = dict(
            showland = True,
            landcolor = "rgb(250, 250, 250)",
            subunitcolor = "rgb(217, 217, 217)",
            countrycolor = "rgb(217, 217, 217)",
            countrywidth = 0.5,
            subunitwidth = 0.5
        )))
fig.update_geos(projection_type="natural earth")

fig.show()

# **Categorical warmer/cooler split**

Like I mentioned above, it is difficult to analyse multi-category data as various aspects influence separate categories uniquely. To combat this, we will split the data based on the occupier's response warmer/cooler/no change. For now, we will leave out the no change group as they are perfectly content as they are.

In [None]:
#Data
warmer_df = df[df["Thermal preference"] == "warmer"]

#Plot
ax, fig = plt.subplots(figsize=(10,5))

plt.barh(warmer_df["Climate"].value_counts().index,warmer_df["Climate"].value_counts())

plt.ylabel("Climate",alpha=.75)
plt.xlabel("Count", fontsize=18, alpha=.75)

plt.yticks(alpha=0.75, fontsize=10)
plt.xticks(alpha=0.75)

plt.title("Count of warmer requests per climate", alpha=.75, fontsize=20, weight="bold", loc="left")

gc.collect()

In [None]:
warmer_season = warmer_df[warmer_df["Climate"].isin(warmer_df["Climate"].value_counts().index[:3])]

warmer_season = warmer_season[["Season",
                               "Climate",
                               "Operative temperature (C)",
                               "Outdoor monthly air temperature (C)"]] 

In [None]:
i = 0
axs, fig = plt.subplots(figsize=(15,5),sharex=True)

for climate in warmer_season["Climate"].unique():
    
    df = warmer_season.query("Climate == @climate")
    
    #25% as sample since my CPU is going to burn up
    sample_size = df.sample(frac=0.25)
    
    try:
        plt.subplot(1,3,1+i)
        i+=1
        sns.swarmplot(x="Season",
                      y="Operative temperature (C)",
                      color="#008FD5",
                      alpha=0.75,
                      data=sample_size,
                      label="Op"
                     )
        sns.swarmplot(x="Season",
                      y="Outdoor monthly air temperature (C)",
                      color="#FF2700",
                      alpha=0.75,
                      data=sample_size,
                      label="Outside"
                     )
        
        plt.title(climate, fontsize=18, alpha=0.75)
        plt.ylabel("Temperature", fontsize=18)
        
    except:
        pass
    
plt.text(x=-11,y=30, s="Operative vs Outside temperature", fontsize=25, weight="bold", alpha=0.75)

In [None]:
ax, fig = plt.subplots(figsize=(10,5))

sns.distplot(warmer_df["Operative temperature (C)"].dropna(), label="Operative temperature")
sns.distplot(warmer_df["Outdoor monthly air temperature (C)"].dropna(), label="Outdoor temperature")

plt.xlabel("Temperature", fontsize=15, alpha=0.75, weight="bold")

plt.title("Temperature distribution: Operative vs Outdoor", fontsize=20, alpha=0.75, weight="bold", loc="left")
plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
          fancybox=True, ncol=5)

# **Not finished from this point onwards :D**

In [None]:
#Lets see how geographic factors affect the pref

cooler_df = df[df["Thermal preference"] == "cooler"]

plt.barh(cooler_df["Climate"].value_counts().index,cooler_df["Climate"].value_counts())

In [None]:
plt.barh(cooler_df["Cooling startegy_building level"].value_counts().index,cooler_df["Cooling startegy_building level"].value_counts())

In [None]:
sns.distplot(cooler_df["Operative temperature (C)"].dropna())

In [None]:
op_temp = df.groupby("City")["Operative temperature (C)"].agg("mean")
op_temp = op_temp.reset_index(name="Mean")

op_temp = pd.merge(op_temp,df_city,how="left", on="City")

In [None]:
fig = go.Figure()

df_part = op_temp.dropna()
df_part = df_part[df_part["Mean"] <= 22]
fig.add_trace(go.Scattergeo(
    lon = df_part["Lng"],
    lat = df_part["Lat"],
    text= df_part["City"] + "<br>Mean Operative Temperature: " + round(df_part["Mean"],2).astype(str),
    name = "<= 22",
    marker = dict(
        size = round(df_part["Mean"]/2,2),
        line_color='rgb(40,40,40)',
        line_width=1.5
    )
))

df_part = op_temp.dropna()
df_part = df_part[df_part["Mean"] > 22]
fig.add_trace(go.Scattergeo(
    lon = df_part["Lng"],
    lat = df_part["Lat"],
    text= df_part["City"] + "<br>Mean Operative Temperature: " + round(df_part["Mean"],2).astype(str),
    name = "> 22",
    marker = dict(
        size = round(df_part["Mean"]/2,2),
        line_color='rgb(40,40,40)',
        line_width=1.5
    )
))

fig.update_layout(dict(
        title = "Geographical operative temperature",
        height=450,
        geo = dict(
            showland = True,
            landcolor = "rgb(250, 250, 250)",
            subunitcolor = "rgb(217, 217, 217)",
            countrycolor = "rgb(217, 217, 217)",
            countrywidth = 0.5,
            subunitwidth = 0.5
        )))
fig.update_geos(projection_type="natural earth")

fig.show()