### Import Packages


In [1]:
# Import packages
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
import re

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sn
import matplotlib.pyplot as plt
from imblearn.over_sampling import RandomOverSampler
from sklearn.svm import SVC


### Anonymise and Clean the Dataset

The original dataset was obtained from Vital Energi's VitalView system.  This is a live system used to assess the performance building Management Systems (BMS) and the heating, ventilation, and air conditiong systems they control.  It contains customer specific contract and building names.  
The following code takes in the original dataset and anonymises customer specific names and identification.  Only the redacted dataset has been provided with this code.
As the focus of the exercise is to predict datapoint category based on its name within a BMS I've dropped samples with generic sensor names and any NaN values within the 'Datapoint' and 'Category' columns only.

In [2]:
# Import original dataset, anonymise customer names with generic contract_n, drop NaN and generic values and save as anonymised

df = pd.read_csv('VitalView_dataset_cleaned.csv')

# Remove Generic Classes from Dataset and drop NaNs from Key Columns (Generic classes occur when correct type is not known)
df.replace({'Generic Analogue Sensor','Generic Digital Sensor','Default Auto'}, np.nan, inplace=True)
columns_to_dropna = ['Datapoint', 'Category']
df.dropna(subset=columns_to_dropna, inplace=True)

# Get the unique classes in the Contract/Site column
contract_column = 'Contract/Site'
contract_names = df[contract_column].unique()

# Create a dictionary to map classes to an anonimised name - contract_1...
class_to_anonymised = {class_name: f'contract_{i}' for i, class_name in enumerate(contract_names)}
df[contract_column] = df[contract_column].replace(class_to_anonymised)
contract_names = df[contract_column].unique()

# Save as New Anonymised .csv file
df.to_csv('VitalView_dataset_anonymised.csv', index=False)


### Load and Visualise the Redacted Dataset

In [3]:
# Import anonymised dataset & visualise
df = pd.read_csv('VitalView_dataset_anonymised.csv')
df.head()


Unnamed: 0,Contract/Site,Plantroom/Area,Datapoint,System,Category
0,contract_0,Barnes,GenAhu_ClnExtFan,Ventilation,Enable Status
1,contract_0,Barnes,GenAhu_Htg_Coil,Ventilation,Controlled Demand Speed
2,contract_0,Barnes,GenAhu_SupFan,Ventilation,Enable Status
3,contract_0,Barnes,HWS_Destrat_Pump1,DHW,Enable Status
4,contract_0,Barnes,HWS_Destrat_Pump2,DHW,Enable Status


### Drop Unimportant Columns
As the focus of the exercise is to predict datapoint category based on its name within a BMS I've dropped all columns excluding the 'Datapoint' which contains the shorthand name that we want to use as the predictor and 'Category' which is the class we want to predict.

In [4]:
# Drop unwanted columns
columns_to_drop = ['Contract/Site', 'Plantroom/Area', 'System']
df.drop(columns=columns_to_drop, inplace=True)
df.head()


Unnamed: 0,Datapoint,Category
0,GenAhu_ClnExtFan,Enable Status
1,GenAhu_Htg_Coil,Controlled Demand Speed
2,GenAhu_SupFan,Enable Status
3,HWS_Destrat_Pump1,Enable Status
4,HWS_Destrat_Pump2,Enable Status


### Visualise the Balance of Samples Within Each Category
It is important to understand the balance between number of samples in each category within the dataset.  If one or another category has significantly more or less samples it can affect the accuracy of model predictions.

In [5]:
# Check balance of categories and oversample if required
pd.set_option('display.max.rows', None) # Ensure df output isn't truncated

category_counts = df['Category'].value_counts()
category_df = pd.DataFrame({'Category Name': category_counts.index, 'Category Count': category_counts.values})
category_df


Unnamed: 0,Category Name,Category Count
0,Enable Status,1280
1,Controlled Demand Speed,458
2,Digital Alarm,295
3,Comfort Temperature,221
4,System Override,190
5,Comfort Temperature Setpoint,178
6,Ventilation Supply Air Temperature Setpoint,135
7,Meter,112
8,Ventilation Supply Air Temperature,86
9,Filter Dirty Status,81


There is a large disparity between categories with the majority count 1,280 and minority count only 3.  This will need to be considered when assessing model performance.

### Create a Custom Tokeniser

As the 'Datapoint' column consists of shorthand text and acronyms it is not possible to use standard text representation methods.  Instead a custom tokeniser will be required to convert key text into a bag of words that can be used to train a model.


In [6]:
# Create custom tokenizer

def customTokeniser (working_df):
    
    datapoint_text = working_df['Datapoint'].str.lower().tolist() # Convert to Lower Case

    # Create Custom List of Common Acronyms to Split
    acronyms_to_split = ['ahu', 'ext', 'fan', 'hws', 'lthw', 'vt', 'pri', 'sec', 'destrat', 'pump', 'oat', 'sup', 'htg', 'coil',
                        'ctrl', 'pool', 'temp', 'vlv', 'flw', 'rtn', 'pir', 'chw', 'en', 'flow', 'flw', 'frost', 'frst', 'hum', 
                        'humidifier', 'sp', 'sup', 'rm', 'spc', 'space', 'flow', 'return', 'setback', 'aq', 'supply', 'extract', 
                        'chiller', 'boiler', 'blr', 'dirty', 'tef', 'eco', 'phe', 'phx', 'reheat', 'heat', 'cool', 'run', 'status',
                        'cal', 'calorifier', 'dx', 'room', 'cef', 'def', 'valve', 'mthw', 'dstrt', 'zone', 'ac', 'aricon', 'dhw', 
                        'mtr', 'meter', 'kwh', 'kw', 'm3', 'press', 'kg', 'import', 'export', 'generated', 'lv', 'gas', 'parasitic',
                        'utilised', 'rejected', 'steam', 'flt', 'fault', 'sts', 'filter', 'fltr', 'trip', 'hand', 'hi', 'high', 
                        'lo', 'low', 'fresh', 'frsh', 'air', 'pnl', 'spd', 'dmd', 'speed', 'demand', 'clean', 'ct', 'lwr', 'str', 
                        'upr', 'mid', 'mcw', 'lights', 'lux', 'ht', 'cut', 'off', 'ambient', 'fire', 'alarm', 'warning', 'undrflr',
                        'clg', 'reht', 'drty', 'hb', 'cmn', 'occ', 'nsb', 'setpoint', 'fume', 'ppm', 'rh', 'district', 'dsrtct',
                        'demand', 'pump', 'imm', 'open', 'on', 'off', 'enabled', 'disabled', 'water', 'electric', 'elec', 'gas',
                        'tz', 'fcu', 'vav', 'mxb', 'dmpr', 'damper', 'override', 'ovrd', 'hours', 'hrs', 'oride', 'fail', 'reset',
                        'lphw', 'hthw', 'hphw', 'pos', 'shunt', 'shnt', 'ok', 'prove', 'winter', 'summer', 'dp','displace', 'setp',
                        'intake', 'filter', 'fltr', 'sts', 'fa', 'speed', 'spd', 'hand', 'pir', 'bag', 'pnl', 'panel', 'not', 'auto',
                        'night', 'setback', 'stbk', 'tz', 'current', 'power', 'energy', 'pos', 'irradiance', 'healthy', 'wtr', 
                        'water', 'comfort']

    # Custom tokenizer to split words using the list of acronyms
    def custom_tokenizer(text):
        for acronym in acronyms_to_split:
            # Use regex to find the acronym in the text and split it into separate words
            text = re.sub(r'\b{}\b'.format(acronym), ' '.join(acronym), text)
        words = text.split()  # Split the text into separate words
        return words

    # Create the CountVectorizer object with the custom tokenizer
    vectorizer = CountVectorizer(tokenizer=custom_tokenizer)

    # Fit and transform the text data into a bag of words representation
    bag_of_words = vectorizer.fit_transform(datapoint_text)

    # Convert the BoW representation into a DataFrame
    bow_df = pd.DataFrame(bag_of_words.toarray(), columns=vectorizer.get_feature_names_out())
    
    return bow_df


### Train and Asses an Initial Model

To gauge a rouhg understanding of potential model performance on the unbalanced dataset I've used a Random Forest and implemented a simple grid search to optimise model hyperparameters.

In [7]:
# Train an initial random forest

def train_RandomForest (df):
    
    X = customTokeniser(df)
    y = df['Category']


    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=1)
    # NOTE: Stratify ensures at least 1 of each class per split
                
        
    # Define the hyperparameter grid to search over
    param_grid = {
        'n_estimators': [20, 40, 80, 100, 200],
        'max_depth': [None, 2, 5, 10, 20],
        'min_samples_split': [5, 10, 20, 30],
        'min_samples_leaf': [5, 10, 20, 30, 40, 50]
    }

    # Do grid search with cross-validation
    rf = RandomForestClassifier(random_state=1)
    cv_rf = GridSearchCV(rf, param_grid, cv=5, n_jobs=-1)
    cv_rf.fit(X_train, y_train)

    # Get best hyperparameters
    best_params = cv_rf.best_params_

    # Train random forest model on entire training set using best hyperparameters
    rf = RandomForestClassifier(**best_params, random_state=1)
    rf.fit(X_train, y_train)
    return rf, X_train, X_test, y_train, y_test, best_params


In [8]:
# Make predictions on the test set - original unblanced dataset

rf, X_train, X_test, y_train, y_test, best_params = train_RandomForest(df)
y_pred = rf.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Get the classification report as a DataFrame
class_report_dict = classification_report(y_test, y_pred, target_names=y_test.unique(), output_dict=True)
class_report_df = pd.DataFrame(class_report_dict).transpose()

# Add the sample counts per category as a new column
category_sample_counts = y_test.value_counts().to_dict()
class_report_df['Sample Count'] = class_report_df.index.map(category_sample_counts)




Accuracy: 0.3099273607748184


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Oversample to Balance Dataset

The prediction performance is very poor for the first model.  This could either be due to the imbalanced dataset or perhaps the Random Forest not being a good fit for the problem.  To test, oversample to see if performance improves.  This could lead to overfitting of the minority categories.


In [9]:
# Dataset is very imbalanced, oversample minority categories and retrain

# Find the number of samples in the majority class
majority_count = df['Category'].value_counts().max()

# Create an empty DataFrame to store the oversampled data
oversampled_df = pd.DataFrame()

# Group the DataFrame by the 'Category' column
grouped = df.groupby('Category')

# Loop through each category and oversample to match the majority count
for category, group in grouped:
    if len(group) < majority_count:
        # Resample the current category to match the majority count
        oversampled_group = group.sample(n=majority_count, replace=True, random_state=42)
        oversampled_df = pd.concat([oversampled_df, oversampled_group])
    else:
        # If the category already has enough samples, keep it as it is
        oversampled_df = pd.concat([oversampled_df, group])


# Print to Confirm Categories Now Balanced
category_counts = oversampled_df['Category'].value_counts()
category_df = pd.DataFrame({'Category Name': category_counts.index, 'Category Count': category_counts.values})
category_df


Unnamed: 0,Category Name,Category Count
0,Ambient Temperature,1280
1,Pool Space Temperature Setback Setpoint,1280
2,Secondary Chw Return Temperature,1280
3,Secondary Chw Flow Temperature,1280
4,Secondary C T Heating Return Temperature,1280
5,Secondary C T Heating Flow Temperature,1280
6,Run Hours Totaliser,1280
7,Primary Heating Return Temperature,1280
8,Primary Heating Return Setpoint,1280
9,Primary Heating Flow Temperature,1280


In [10]:
# Retrain model with oversampled dataset

X = customTokeniser(oversampled_df)
y = oversampled_df['Category']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)

# Initialize the Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, max_depth=50, max_features='sqrt', random_state=1)

# Train the Random Forest Classifier
rf_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = rf_classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


Accuracy: 0.6254815924657534


The model accuracy appears much better, however this could be related to overfitting on the oversampled minority classes so visualise performance by category to check.

In [11]:
# Visualise model performance by category

class_report_dict = classification_report(y_test, y_pred, target_names=y_test.unique(), output_dict=True)
class_report_df = pd.DataFrame(class_report_dict).transpose()
class_report_df


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,precision,recall,f1-score,support
Secondary Chw Return Temperature,1.0,0.383104,0.553977,509.0
Controlled Demand Speed Setpoint,1.0,0.234818,0.380328,494.0
Controlled Demand Speed,1.0,1.0,1.0,518.0
OAT Eco Off Setpoint,1.0,0.709559,0.830108,544.0
DHW Return Temperature,1.0,1.0,1.0,511.0
Underfloor Heating Flow Temperature Setpoint,1.0,0.605578,0.754342,502.0
Secondary Vt Heating Return Setpoint,1.0,0.412229,0.583799,507.0
Ventilation Preheat Temperature,1.0,0.34004,0.507508,497.0
Lux Level Setpoint,1.0,1.0,1.0,546.0
Ambient Temperature,0.0,0.0,0.0,517.0


### Retrain Model on Majority Classes Only
Model performance is significantly affected by overfitting on the minority categories, as a test retrain the model on the majority categories only to see if performance improves.

In [12]:
# Try predicting on majority categories only

# Drop cateogies with < 50 Sample
category_counts = df['Category'].value_counts()
categories_to_drop = category_counts[category_counts < 50].index
df_majority = df[~df['Category'].isin(categories_to_drop)]

# Retrain and evaluate model
rf, X_train, X_test, y_train, y_test, best_params = train_RandomForest(df_majority)
y_pred = rf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


Accuracy: 0.3775811209439528


### Create a Binary Vector Tokeniser
The performance of the Random Forest is still relatively poor.  Instead of oversampling the original dataset, or removing the minority categories, it may be more beneficial to build a dedicated training set by randomly selecting a fixed number of samples from each category.  The BoW representation won't be suitable for this as the training and test sets will need to be split prior to being tokenised.  This will result in a different number of vectors per dataset.  Instead I've created a binary vector tokeniser that will ensure the same number of vectors between the dedicated training dataset and the test dataset.


In [13]:
# Create new tokeniser to generate binary vectors based on whether each shorthand/acronym exists or not

def binary_vector_tokenizer (working_df):
    
    binary_vectors_list = []
    
    # Create Custom List of Common Acronyms to Split
    acronyms_to_split = ['ahu', 'ext', 'fan', 'hws', 'lthw', 'vt', 'pri', 'sec', 'destrat', 'pump', 'oat', 'sup', 'htg', 'coil',
                        'ctrl', 'pool', 'temp', 'vlv', 'flw', 'rtn', 'pir', 'chw', 'en', 'flow', 'flw', 'frost', 'frst', 'hum', 
                        'humidifier', 'sp', 'sup', 'rm', 'spc', 'space', 'flow', 'return', 'setback', 'aq', 'supply', 'extract', 
                        'chiller', 'boiler', 'blr', 'dirty', 'tef', 'eco', 'phe', 'phx', 'reheat', 'heat', 'cool', 'run', 'status',
                        'cal', 'calorifier', 'dx', 'room', 'cef', 'def', 'valve', 'mthw', 'dstrt', 'zone', 'ac', 'aricon', 'dhw', 
                        'mtr', 'meter', 'kwh', 'kw', 'm3', 'press', 'kg', 'import', 'export', 'generated', 'lv', 'gas', 'parasitic',
                        'utilised', 'rejected', 'steam', 'flt', 'fault', 'sts', 'filter', 'fltr', 'trip', 'hand', 'hi', 'high', 
                        'lo', 'low', 'fresh', 'frsh', 'air', 'pnl', 'spd', 'dmd', 'speed', 'demand', 'clean', 'ct', 'lwr', 'str', 
                        'upr', 'mid', 'mcw', 'lights', 'lux', 'ht', 'cut', 'off', 'ambient', 'fire', 'alarm', 'warning', 'undrflr',
                        'clg', 'reht', 'drty', 'hb', 'cmn', 'occ', 'nsb', 'setpoint', 'fume', 'ppm', 'rh', 'district', 'dsrtct',
                        'demand', 'pump', 'imm', 'open', 'on', 'off', 'enabled', 'disabled', 'water', 'electric', 'elec', 'gas',
                        'tz', 'fcu', 'vav', 'mxb', 'dmpr', 'damper', 'override', 'ovrd', 'hours', 'hrs', 'oride', 'fail', 'reset',
                        'lphw', 'hthw', 'hphw', 'pos', 'shunt', 'shnt', 'ok', 'prove', 'winter', 'summer', 'dp','displace', 'setp',
                        'intake', 'filter', 'fltr', 'sts', 'fa', 'speed', 'spd', 'hand', 'pir', 'bag', 'pnl', 'panel', 'not', 'auto',
                        'night', 'setback', 'stbk', 'tz', 'current', 'power', 'energy', 'pos', 'irradiance', 'healthy', 'wtr', 
                        'water', 'comfort']
    
    working_df['Datapoint'] = working_df['Datapoint'].str.lower() # Convert to Lower Case
    
    for sample in working_df['Datapoint']:
        binary_vectors = []
        
        for acronym in acronyms_to_split:
            is_present = int(acronym in sample)
            binary_vectors.append(is_present)
        
        binary_vectors_list.append(binary_vectors)
    
    return binary_vectors_list


# Get binary vectors for each sentence (acronym)
binary_vectors_list = binary_vector_tokenizer(df)

#print("Binary vectors for each sentence (acronym):")
#for i, vectors in enumerate(binary_vectors_list):
#    print(f"Sentence {i+1}: {vectors}")


In [14]:
# Retrain and evaluate using the binary vector tokeniser and a dedicated balanced dataset for model training

# Group by category
grouped = df.groupby('Category')

# Randomly sample 50 samples from each category
samples_per_category = 50
sampled_data = []

for category, group in grouped:
     if len(group) >= samples_per_category:
         sampled_group = group.sample(samples_per_category)
     else:
         # If a category has fewer than 50 samples, you can choose to either duplicate samples
         num_samples_to_duplicate = samples_per_category - len(group)
         duplicated_samples = group.sample(num_samples_to_duplicate, replace=True)
         sampled_group = pd.concat([group, duplicated_samples])
     sampled_data.append(sampled_group)

# Combine sampled data from all categories into df for training
balanced_training_df = pd.concat(sampled_data)

X_train = binary_vector_tokenizer(balanced_training_df)
y_train = balanced_training_df['Category']
X_test = binary_vector_tokenizer(df)
y_test = df['Category']

 
# Define the hyperparameter grid to search over
param_grid = {
     'n_estimators': [20, 40, 80, 100, 200],
     'max_depth': [None, 2, 5, 10, 20],
     'min_samples_split': [5, 10, 20, 30],
     'min_samples_leaf': [5, 10, 20, 30, 40, 50]
}

# Do grid search with cross-validation
rf = RandomForestClassifier(random_state=1)
cv_rf = GridSearchCV(rf, param_grid, cv=5, n_jobs=-1)
cv_rf.fit(X_train, y_train)

# Get best hyperparameters
best_params = cv_rf.best_params_

# Train random forest model on entire training set using best hyperparameters
rf = RandomForestClassifier(**best_params, random_state=1)
rf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = rf.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Get the classification report as a DataFrame
class_report_dict = classification_report(y_test, y_pred, target_names=y_test.unique(), output_dict=True)
class_report_df = pd.DataFrame(class_report_dict).transpose()

# Add the sample counts per category as a new column
category_sample_counts = y_test.value_counts().to_dict()
class_report_df['Sample Count'] = class_report_df.index.map(category_sample_counts)


Accuracy: 0.6493097602325019


In [15]:
# Visualise model performance by category

class_report_df = class_report_df.sort_values(by='Sample Count', ascending=False)
class_report_df


Unnamed: 0,precision,recall,f1-score,support,Sample Count
Enable Status,0.93617,0.6875,0.792793,64.0,1280.0
Controlled Demand Speed,0.488889,1.0,0.656716,22.0,458.0
Digital Alarm,1.0,1.0,1.0,8.0,295.0
Comfort Temperature,0.653061,1.0,0.790123,32.0,221.0
System Override,0.741379,0.767857,0.754386,112.0,190.0
Comfort Temperature Setpoint,0.131579,1.0,0.232558,5.0,178.0
Ventilation Supply Air Temperature Setpoint,0.243056,0.921053,0.384615,38.0,135.0
Meter,0.666667,1.0,0.8,4.0,112.0
Ventilation Supply Air Temperature,0.964384,0.768559,0.855407,458.0,86.0
Filter Dirty Status,0.35,1.0,0.518519,7.0,81.0


### Model with an SVM and Compare Performance
Using a binary vector tokeniser and dedicated balanced dataset to train the model does provide better results.  Whilst the overall acuracy is still relatively low at 66%, and there is still some overfitting on the smallest minority categories, the predictions across categories are generally more realistic.  The lower accuracy is likely now an aspect of using a Random Forest with sparse binary vectors, to check this I've created an SVM model below as these are generally better at handing sparse binary vectors.  Again, I've used a simple grid search to tune model hyperparameters.


In [16]:
# Train SVM model using the binary vector tokeniser and dedicated balanced dataset for model training

# Group by category
grouped = df.groupby('Category')

# Randomly sample 50 samples from each category
samples_per_category = 50
sampled_data = []

for category, group in grouped:
     if len(group) >= samples_per_category:
         sampled_group = group.sample(samples_per_category)
     else:
         # If a category has fewer than 20 samples, you can choose to either duplicate samples
         num_samples_to_duplicate = samples_per_category - len(group)
         duplicated_samples = group.sample(num_samples_to_duplicate, replace=True)
         sampled_group = pd.concat([group, duplicated_samples])
     sampled_data.append(sampled_group)

# Combine sampled data from all categories
balanced_df = pd.concat(sampled_data)

X_train = binary_vector_tokenizer(balanced_df)
y_train = balanced_df['Category']
X_test = binary_vector_tokenizer(df)
y_test = df['Category']

 
# Create SVM model
svm_model = SVC()

# Set up the parameter grid for grid search
param_grid = {
    'C': [0.1, 1, 10],                # Regularization parameter values to try
    'kernel': ['linear', 'rbf'],      # Kernel types to try
}
    
    
# Create the GridSearchCV object
grid_search = GridSearchCV(svm_model, param_grid, cv=3, scoring='accuracy')

# Perform grid search on the training data
grid_search.fit(X_train, y_train)

# Get the best SVM model with the optimal hyperparameters
best_svm_model = grid_search.best_estimator_

# Make predictions on the test data using the best model
y_pred = best_svm_model.predict(X_test)

# Evaluate the model
category_sample_counts = y_test.value_counts().to_dict()

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Get the classification report as a DataFrame
class_report_dict = classification_report(y_test, y_pred, target_names=y_test.unique(), output_dict=True)
class_report_df = pd.DataFrame(class_report_dict).transpose()

# Add the sample counts per category as a new column
category_sample_counts = y_test.value_counts().to_dict()
class_report_df['Sample Count'] = class_report_df.index.map(category_sample_counts)


Accuracy: 0.8936788568660693


In [17]:
# Visualise performance by category
class_report_df = class_report_df.sort_values(by='Sample Count', ascending=False)
class_report_df


Unnamed: 0,precision,recall,f1-score,support,Sample Count
Enable Status,0.859155,0.953125,0.903704,64.0,1280.0
Controlled Demand Speed,1.0,1.0,1.0,22.0,458.0
Digital Alarm,1.0,1.0,1.0,8.0,295.0
Comfort Temperature,0.864865,1.0,0.927536,32.0,221.0
System Override,0.849206,0.955357,0.89916,112.0,190.0
Comfort Temperature Setpoint,0.333333,1.0,0.5,5.0,178.0
Ventilation Supply Air Temperature Setpoint,0.486486,0.947368,0.642857,38.0,135.0
Meter,1.0,1.0,1.0,4.0,112.0
Ventilation Supply Air Temperature,0.960373,0.899563,0.928974,458.0,86.0
Filter Dirty Status,0.777778,1.0,0.875,7.0,81.0


### Summary and Recomendations
In summary, due to the significant imbalance between the different categories in the original dataset, training a model with simple train/text split or oversampling minority categories does not perform well.  Instead, a dedicated training set had to be created from a smaller yet representative sample from each category.  This combined with a custom dictionary and tokeniser to convert key text from the 'Datapoint' category and an SVM model achieved good results with overall accuracy of 90%.  However, there is still some evidence of overfitting on the smallest minority categories.
It is recommended that further data be exported from VitalView to provide a minimum of 100 samples for categories from which a more robust training set can be constructed.
