### **CS6501 - MACHINE LEARNING AND APPLICATIONS**
#### **NOTEBOOK-2: Feature Lookup**

**Description:**  
In this notebook, with the help of Generative AI (using LLaMA 3.2 1B Instruct), we analyze and gain an understanding of each column in the cleaned BER dataset. Each feature is categorized as EASY, MEDIUM, or HARD in terms of measurability and acquirability, providing a usability-driven perspective for downstream feature selection and modeling.

#### --- Library Imports ---

In [1]:
import pandas as pd
import torch
from transformers import pipeline
import json
import os
import time

#### --- Load The Cleaned Dataset ---

In [2]:
# File path to your cleaned dataset
file_path = r"..\dataset\BERPublicsearch_Cleaned.csv"
# load the dataset
df = pd.read_csv(file_path)
print(f"Dataset loaded: {df.shape[0]} rows × {df.shape[1]} columns")

Dataset loaded: 80000 rows × 101 columns


#### --- Get Feature Names ---

In [3]:
feature_name_list = df.columns.tolist()
print("Columns in the cleaned dataset :\n")
print(feature_name_list)

Columns in the cleaned dataset :

['CountyName', 'DwellingTypeDescr', 'Year_of_Construction', 'TypeofRating', 'EnergyRating', 'BerRating', 'GroundFloorArea(sq m)', 'UValueWall', 'UValueRoof', 'UValueFloor', 'UValueWindow', 'UvalueDoor', 'WallArea', 'RoofArea', 'FloorArea', 'WindowArea', 'DoorArea', 'NoStoreys', 'MainSpaceHeatingFuel', 'MainWaterHeatingFuel', 'HSMainSystemEfficiency', 'TGDLEdition', 'MPCDERValue', 'HSEffAdjFactor', 'HSSupplHeatFraction', 'HSSupplSystemEff', 'WHMainSystemEff', 'WHEffAdjFactor', 'SupplSHFuel', 'SupplWHFuel', 'SHRenewableResources', 'WHRenewableResources', 'NoOfChimneys', 'NoOfOpenFlues', 'NoOfFansAndVents', 'DraftLobby', 'VentilationMethod', 'FanPowerManuDeclaredValue', 'StructureType', 'PercentageDraughtStripped', 'NoOfSidesSheltered', 'PermeabilityTest', 'PermeabilityTestResult', 'TempAdjustment', 'HeatSystemControlCat', 'HeatSystemResponseCat', 'NoCentralHeatingPumps', 'CHBoilerThermostatControlled', 'NoOilBoilerHeatingPumps', 'NoGasBoilerHeatingPumps'

In [4]:
# target and leakage columns (they won't be used in upcoming feature selection)
columns_to_remove = [
    'BerRating',      # Target
    'EnergyRating',   # Categorical BER rating (B1, B2, etc.) derived from BerRating (leakage)
]

In [5]:
# dropping the target and lekage clumns
feature_name_list = [col for col in feature_name_list if col not in columns_to_remove]

In [6]:
print("Features to generate the lookup -",len(feature_name_list)," :\n")
print(feature_name_list)

Features to generate the lookup - 99  :

['CountyName', 'DwellingTypeDescr', 'Year_of_Construction', 'TypeofRating', 'GroundFloorArea(sq m)', 'UValueWall', 'UValueRoof', 'UValueFloor', 'UValueWindow', 'UvalueDoor', 'WallArea', 'RoofArea', 'FloorArea', 'WindowArea', 'DoorArea', 'NoStoreys', 'MainSpaceHeatingFuel', 'MainWaterHeatingFuel', 'HSMainSystemEfficiency', 'TGDLEdition', 'MPCDERValue', 'HSEffAdjFactor', 'HSSupplHeatFraction', 'HSSupplSystemEff', 'WHMainSystemEff', 'WHEffAdjFactor', 'SupplSHFuel', 'SupplWHFuel', 'SHRenewableResources', 'WHRenewableResources', 'NoOfChimneys', 'NoOfOpenFlues', 'NoOfFansAndVents', 'DraftLobby', 'VentilationMethod', 'FanPowerManuDeclaredValue', 'StructureType', 'PercentageDraughtStripped', 'NoOfSidesSheltered', 'PermeabilityTest', 'PermeabilityTestResult', 'TempAdjustment', 'HeatSystemControlCat', 'HeatSystemResponseCat', 'NoCentralHeatingPumps', 'CHBoilerThermostatControlled', 'NoOilBoilerHeatingPumps', 'NoGasBoilerHeatingPumps', 'DistributionLosses'

#### --- Generate Feature Metadata ---

In [7]:
system_prompt = """You are an advanced AI assistant that explains the meaning of a single feature from the Ireland BER (Building Energy Rating) dataset.

The user will provide one column name.  
Your job is to output a strict JSON object with three keys:

{
  "category": "one of: Geometry, Envelope, Heating, Ventilation, Energy",
  "availability": "one of: easy, medium, hard",
  "description": "one-line plain-English meaning of the feature"
}

CATEGORY RULES:
- Geometry : size, areas, dimensions, layout realated features 
- Envelope : insulation and material realated features
- Heating : heating systems and associated components based feature
- Ventilation : ventilation and air circulation related features
- Energy : energy and related metadatas

AVAILABILITY RULES:
- easy : the feature is easy to acquire from the building with general measurement tools or knowledge 
- medium : a trained assessor only can measure it with advanced tools  
- hard : derived or calculated from many parameters

Always return **valid JSON only**. No extra text."""


In [11]:
model_id = "meta-llama/Llama-3.2-1B-Instruct"
pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    temperature=0.1,
    top_p=0.9,
    token=""
)

'(ProtocolError('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None)), '(Request ID: d480059f-a029-4630-ad57-e661652ceba5)')' thrown while requesting HEAD https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct/resolve/main/config.json
Retrying in 1s [Retry 1/5].

KeyboardInterrupt



In [9]:
csv_file = "../dataset/BER_feature_metadata.csv"
if os.path.exists(csv_file):
    df = pd.read_csv(csv_file,encoding='latin1')
    processed_features = set(df['feature_name'].tolist())
else:
    df = pd.DataFrame()
    processed_features = set()

for feature_name in feature_name_list:
    if feature_name in processed_features:
        print(f"skipping already processed feature: {feature_name}")
        continue
    print("generating metadata of:", feature_name)
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"explain the feature: {feature_name}"}
    ]
    retries = 0
    success = False
    while retries < 3 and not success:
        try:
            output = pipe(messages, max_new_tokens=500, pad_token_id=0)[0]["generated_text"]
            json_string = output[-1]['content'] 
            feature_metadata = json.loads(json_string) 
            success = True
        except (json.JSONDecodeError, IndexError, TypeError) as e:
            retries += 1
            print(f"failed to parse json for {feature_name}, attempt {retries}/3")
            if retries < 3:
                print("retrying...")
            else:
                print("skipping this feature due to repeated json parse errors")
                feature_metadata = {}
    
    if success and feature_metadata is not None:
        feature_metadata['feature_name'] = feature_name
        df = pd.concat([df, pd.DataFrame([feature_metadata])], ignore_index=True)
        df.to_csv(csv_file, index=False)
    else:
        print(f"{feature_name} not added to CSV due to parsing failure")
        
    time.sleep(3)

skipping already processed feature: CountyName
skipping already processed feature: DwellingTypeDescr
skipping already processed feature: Year_of_Construction
skipping already processed feature: TypeofRating
skipping already processed feature: GroundFloorArea(sq m)
skipping already processed feature: UValueWall
skipping already processed feature: UValueRoof
skipping already processed feature: UValueFloor
skipping already processed feature: UValueWindow
skipping already processed feature: UvalueDoor
skipping already processed feature: WallArea
skipping already processed feature: RoofArea
skipping already processed feature: FloorArea
skipping already processed feature: WindowArea
skipping already processed feature: DoorArea
skipping already processed feature: NoStoreys
skipping already processed feature: MainSpaceHeatingFuel
skipping already processed feature: MainWaterHeatingFuel
skipping already processed feature: HSMainSystemEfficiency
skipping already processed feature: TGDLEdition
sk

In [10]:
# check if all features generated
remaining_features = [f for f in feature_name_list if f not in df['feature_name'].tolist()]
print("number of features left:", len(remaining_features))

number of features left: 0
