# Llama2 working on CS Cluster machines for increased speed (and accuracy)
Use to run on CS Cluster / locally.

## Step 0: Make sure Nvidia RTX is running.

In [3]:
# check for nvidia CUDA
!nvidia-smi

Tue Feb 27 01:27:37 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 3090 Ti     On  | 00000000:01:00.0 Off |                  Off |
|  0%   46C    P8              13W / 450W |     28MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

## Step 1: Assuming that folder "models" exists, set up Llama2 using ctransformers
This version of Llama-13b-chat is quantised down to 5 bits for enhanced accuracy.

In [4]:
# load ctransformers and set up parameters
from ctransformers import AutoModelForCausalLM
llm = AutoModelForCausalLM.from_pretrained(
    model_path_or_repo_id="./models/13B-chat-GGUF-q5_K_M.gguf",
    model_file="13B-chat-GGUF-q5_K_M.gguf",
    model_type="llama",
    max_new_tokens=75,
    repetition_penalty=1.2,
    temperature=0.25,
    top_p=0.95,
    top_k=150,
    threads=22,
    batch_size=40,
    gpu_layers=50,
)

## Step 2: Load Excel Spreadsheet and make into a DataFrame.

In [5]:
import pandas as pd
import os
import re

# specify the path to Excel file
excel_file_path = '../data/hazard_definitions.xlsx'

# read the Excel file into a pandas DataFrame
df = pd.read_excel(excel_file_path)

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [6]:
df

Unnamed: 0,Hazard_Code,Hazard_Category,Hazard_Subcategory,Hazard_Name,Hazard_Description,Upstream_Hazards,Excluded_Hazards,Synonyms,Keywords,Keywords_Operator,Questions
0,MH0001,METEOROLOGICAL AND HYDROLOGICAL,Convective-Related,Downburst,Downburst A downburst is a violent and damagin...,,,"Microburst, Macroburst, Wind Sear",Wind,OR,Did a downburst occur?
1,MH0002,METEOROLOGICAL AND HYDROLOGICAL,Convective-Related,Lightning (Electrical Storm),Lightning Lightning is the luminous manifestat...,,,"Bolt, Thunderbolt, Bolt-from-the-blue, Firebol...","Lightning, storm",OR,Was there lightning?
2,MH0003,METEOROLOGICAL AND HYDROLOGICAL,Convective-Related,Thunderstorm,Thunderstorm A thunderstorm is defined as one ...,MH0002,,,storm,OR,Was there a thunderstorm?
3,MH0004,METEOROLOGICAL AND HYDROLOGICAL,Flood,Coastal Flood,Coastal Flood Coastal flooding is most frequen...,,,"Storm Surge, Coastal inundation",Flood,OR,Did the even take place at the coast?
4,MH0005,METEOROLOGICAL AND HYDROLOGICAL,Flood,Estuarine (Coastal) Flood,Estuarine Flood Estuarine flooding is flooding...,MH0004,,"Flood, Flooding, Coastal flooding",river,OR,Was the river flooding caused by coastal flood...
...,...,...,...,...,...,...,...,...,...,...,...
297,SO0004,SOCIETAL,Post-Conflict,Explosive Remnants of War,Explosive Remnants of War Explosive remnants o...,SO0003,,"Unexploded ordnance, Abandoned explosive ordnance","explosive, crater, mines",AND,Were there any reports of explosive remnants o...
298,SO0005,SOCIETAL,Post-Conflict,Environmental Degradation from Conflict,Environmental Degradation from Conflict Enviro...,SO0003,,"Ecological degradation, Environmental damage","pollution, war",AND,Were there any reports of environmental degrad...
299,SO0006,SOCIETAL,Behavioural,Violence,Violence Violence refers to the intentional or...,SO0003,,,violence,OR,Were there any reports of violence?
300,SO0007,SOCIETAL,Behavioural,Stampede or Crushing (Human),Stampede or Crushing Stampede or crushing is t...,SO0003,,"Crush, Mass panic, Crowd disaster","stampede, crushing",OR,Were there any reports of stampede or crushing?


## Step 3: Define some helper functions to output correct scores.

In [7]:
def extract_score(response):
    """
    Extracts the likelihood score from the user's response.

    Parameters:
    response (str): The user's response.

    Returns:
    float: The extracted likelihood score if found, -1 otherwise.
    """
    user_input = response

    # use regular expression to extract the number between 0 and 5"
    likelihood_match = re.search(r'\b([0-5]|\([0-5]\))\b', user_input)

    # check if a match is found and print the extracted number
    if likelihood_match:
        extracted_likelihood = float(likelihood_match.group(1) or likelihood_match.group(2))
        print("Extracted Likelihood:", extracted_likelihood)
        return extracted_likelihood
    else:
        print("No likelihood score found.")
        return -1

## Step 4: Create ~ 303^2 combinations and feed through to LLM model.

In [8]:
# take combinations
from itertools import product

# extracting unique values from the desired column
unique_values = df['Hazard_Name'].unique() # change to create AxA matrix, remove to make all
print(unique_values)

# create double-sided pairs without pairs with the same element
pairs = [(x, y) for x, y in product(unique_values, unique_values) if x != y]

['Downburst' 'Lightning (Electrical Storm)' 'Thunderstorm' 'Coastal Flood'
 'Estuarine (Coastal) Flood' 'Flash Flood' 'Fluvial (Riverine) Flood'
 'Groundwater Flood' 'Ice-Jam Flood Including Debris'
 'Ponding (Drainage) Flood' 'Snowmelt Flood' 'Surface Water Flooding'
 'Glacial Lake Outburst Flood' 'Black Carbon (Brown Clouds)'
 'Dust storm or Sandstorm' 'Fog' 'Haze' 'Polluted Air' 'Sand haze' 'Smoke'
 'Ocean Acidification' 'Rogue Wave' 'Sea Water Intrusion'
 'Sea Ice (Ice Bergs)' 'Ice Flow' 'Seiche' 'Storm Surge' 'Storm Tides'
 'Tsunami' 'Depression or Cyclone (Low Pressure Area)'
 'Extra-tropical Cyclone' 'Sub-Tropical Cyclone' 'Acid Rain' 'Blizzard'
 'Drought' 'Hail' 'Ice Storm' 'Snow' 'Snow Storm' 'Cold Wave' 'Dzud'
 'Freeze' 'Frost (Hoar Frost)' 'Freezing Rain (Supercooled Rain)' 'Glaze'
 'Ground Frost' 'Heatwave' 'Icing (Including Ice)' 'Thaw' 'Avalanche'
 'Mud Flow' 'Rock slide' 'Derecho' 'Gale (Strong Gale)' 'Squall'
 'Subtropical Storm' 'Tropical Cyclone (Cyclonic Wind/ Rain S

## Step 5: Create new dataframes to store new data.

In [9]:
# Check if the excel files already exist
if os.path.exists("./out/scores.xlsx") and os.path.exists("./out/justifications.xlsx"):
    # Load the existing excel files
    scores_df = pd.read_excel("./out/scores.xlsx", index_col=0, dtype=str)
    justification_df = pd.read_excel("./out/justifications.xlsx", index_col=0, dtype=str)
else:
    # Create new excel files
    # print("poo")
    scores_df = pd.DataFrame(index=unique_values, columns=unique_values)
    justification_df = pd.DataFrame(index=unique_values, columns=unique_values)

scores_df

Unnamed: 0,Downburst,Lightning (Electrical Storm),Thunderstorm,Coastal Flood,Estuarine (Coastal) Flood,Flash Flood,Fluvial (Riverine) Flood,Groundwater Flood,Ice-Jam Flood Including Debris,Ponding (Drainage) Flood,...,Road Traffic Accident,Explosive agents,International Armed Conflict (IAC),Non-International Armed Conflict (NIAC),Civil Unrest,Explosive Remnants of War,Environmental Degradation from Conflict,Violence,Stampede or Crushing (Human),Financial shock
Downburst,,-1,-1,-1,-1,-1,-1,-1,-1,-1,...,,,,,,,,,,
Lightning (Electrical Storm),,,,,,,,,,,...,,,,,,,,,,
Thunderstorm,,,,,,,,,,,...,,,,,,,,,,
Coastal Flood,,,,,,,,,,,...,,,,,,,,,,
Estuarine (Coastal) Flood,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Explosive Remnants of War,,,,,,,,,,,...,,,,,,,,,,
Environmental Degradation from Conflict,,,,,,,,,,,...,,,,,,,,,,
Violence,,,,,,,,,,,...,,,,,,,,,,
Stampede or Crushing (Human),,,,,,,,,,,...,,,,,,,,,,


### Step 5.5: Find the first pair which does not have a value stored

In [10]:
def find_first_missing_pair(df):
    '''
    Finds the first pair of row and column categories in a DataFrame where the corresponding element is missing.

    Parameters:
    - df (pd.DataFrame): The DataFrame to search for missing values.

    Returns:
    - tuple or None: If a missing value is found, returns a tuple containing the row and column categories of the missing element. If no missing values are found, returns None.
    '''
    categories = df.columns  # Exclude the first column which contains row headers
    for i, row_category in enumerate(categories):
        for j, col_category in enumerate(categories):
            if i != j:  # Exclude pairs where both elements are the same
                value = df.iloc[i, j]  # j+1 to account for the row header column
                if pd.isna(value):  # Check if the value is NaN
                    print(i, j)
                    return (row_category, col_category)
    return None

missing_value = find_first_missing_pair(scores_df)
missing_value

0 22


('Downburst', 'Sea Water Intrusion')

## Step 6: Run the LLM.

In [11]:
def run_llm(hazard1, hazard2, def1, def2):
    '''
    Run a likelihood assessment using a Language Model (LLM) to evaluate the likelihood that a given hazard1 causes hazard2.

    Parameters:
    - hazard1 (str): The first hazard in the assessment.
    - hazard2 (str): The second hazard in the assessment.
    - def1 (str): The definition of the first hazard.
    - def2 (str): The definition of the second hazard.

    Returns:
    tuple: A tuple containing the numerical score representing the likelihood assessment (ranging from 0 to 5) and the detailed response from the language model.
    '''
    # cut definitions to be only first sentence
    def1 = def1.split(".")[0]
    def2 = def2.split(".")[0]
    
    # begin prompting
    prompt = f"""What is the likelihood that {hazard1} causes {hazard2}, bearing in mind:
    {hazard1}: {def1}
    {hazard2}: {def2}
    """
    
    super_prompt = f"""
    SYSTEM: We're evaluating the likelihood of various hazards causing specific outcomes. Your responses should be one number between 0 and 5, following the below scale. Include a short explanation for your score, as it helps understand the reasoning behind your assessment.
    
    - 0: Almost never
    - 1: Very Unlikely
    - 2: Unlikely
    - 3: Likely
    - 4: Very likely
    - 5: Almost always

    Given the above, consider the following query:

    USER: {prompt}
    
    ASSISTANT:
    """
    
    response = llm(super_prompt)
    print(f"Running: {hazard1}, {hazard2}")
    print(response)
    score = extract_score(response)
    print("====================================")
    return (score, response)
    

In [12]:
pairs

[('Downburst', 'Lightning (Electrical Storm)'),
 ('Downburst', 'Thunderstorm'),
 ('Downburst', 'Coastal Flood'),
 ('Downburst', 'Estuarine (Coastal) Flood'),
 ('Downburst', 'Flash Flood'),
 ('Downburst', 'Fluvial (Riverine) Flood'),
 ('Downburst', 'Groundwater Flood'),
 ('Downburst', 'Ice-Jam Flood Including Debris'),
 ('Downburst', 'Ponding (Drainage) Flood'),
 ('Downburst', 'Snowmelt Flood'),
 ('Downburst', 'Surface Water Flooding'),
 ('Downburst', 'Glacial Lake Outburst Flood'),
 ('Downburst', 'Black Carbon (Brown Clouds)'),
 ('Downburst', 'Dust storm or Sandstorm'),
 ('Downburst', 'Fog'),
 ('Downburst', 'Haze'),
 ('Downburst', 'Polluted Air'),
 ('Downburst', 'Sand haze'),
 ('Downburst', 'Smoke'),
 ('Downburst', 'Ocean Acidification'),
 ('Downburst', 'Rogue Wave'),
 ('Downburst', 'Sea Water Intrusion'),
 ('Downburst', 'Sea Ice (Ice Bergs)'),
 ('Downburst', 'Ice Flow'),
 ('Downburst', 'Seiche'),
 ('Downburst', 'Storm Surge'),
 ('Downburst', 'Storm Tides'),
 ('Downburst', 'Tsunami'),


## Step 7: Record and update spreadsheets.

In [13]:
# store the scores and justifications in the new dataframes
i = 0
# missing_value contains the first pair that does not have a value stored, slice the pairs list to start from there
pairs = pairs[pairs.index(missing_value):]

for pair in pairs:
    data = run_llm(pair[0], pair[1], df[df['Hazard_Name'] == pair[0]]['Hazard_Description'].values[0], df[df['Hazard_Name'] == pair[1]]['Hazard_Description'].values[0])
    scores_df.loc[pair[0], pair[1]] = data[0]
    justification_df.loc[pair[0], pair[1]] = data[1]
    if i % 20 == 0:
      scores_df.to_excel("./out/scores.xlsx")
      justification_df.to_excel("./out/justifications.xlsx")
    i+=1

Running: Downburst, Sea Water Intrusion

    Based on my training data, I would rate the likelihood of Downburst causing Sea Water Intrusion as a 3 - Likely. This is because downbursts can cause significant damage to coastal infrastructure and disrupt natural barriers that prevent seawater intrusion into freshwater aquifers. The strong winds and heavy
Extracted Likelihood: 3.0
Running: Downburst, Sea Ice (Ice Bergs)

    Based on my training data and knowledge, I would rate the likelihood of Downburst causing Sea Ice (Ice Bergs) as a 2 - Unlikely. This is because downbursts are typically associated with severe thunderstorms that occur over land or near coastal areas, rather than in open ocean waters where sea ice forms.
Extracted Likelihood: 2.0
Running: Downburst, Ice Flow

    Based on my training data, I would rate the likelihood of Downburst causing Ice Flow as a 2 - Unlikely. This is because downbursts are typically associated with warm air and heavy precipitation, which does not 

KeyboardInterrupt: 

## Step 8: Rerun all instances of -1, until there are no more -1's

In [14]:
def find_invalid_pairs():
    '''
    Finds pairs of row and column categories in the scores DataFrame where the corresponding element is -1

    Returns:
    - list: A list of tuples containing the row and column categories of the missing elements.
    '''
    scores_df = pd.read_excel("./out/scores.xlsx", index_col=0, dtype=str)
    
    # check which pairs are equal to -1
    invalid_pairs = []
    for i, row in scores_df.iterrows():
        for j, value in row.items():
            if value == "-1":
                invalid_pairs.append((i, j))
    
    return invalid_pairs

In [15]:
# add the invalid pairs to update score_df
j = 0
invalid_pairs = find_invalid_pairs()

# until no more invalid pairs
while invalid_pairs:
    for pair in invalid_pairs:
        data = run_llm(pair[0], pair[1], df[df['Hazard_Name'] == pair[0]]['Hazard_Description'].values[0], df[df['Hazard_Name'] == pair[1]]['Hazard_Description'].values[0])
        scores_df.loc[pair[0], pair[1]] = data[0]
        justification_df.loc[pair[0], pair[1]] = data[1]
        if j % 20 == 0:
            scores_df.to_excel("./out/scores.xlsx")
            justification_df.to_excel("./out/justifications.xlsx")
        j+=1
    # reset, and check again
    j = 0
    invalid_pairs = find_invalid_pairs()

Running: Downburst, Lightning (Electrical Storm)

    Based on my training data and knowledge, I would rate the likelihood of Downburst causing Lightning (Electrical Storm) as 4 - Very Likely. This is because downbursts are often associated with severe thunderstorms, which can produce lightning. In fact, lightning is one of the primary dangers pos
Extracted Likelihood: 4.0
Running: Downburst, Thunderstorm

    Based on the information provided, I would rate the likelihood that Downburst causes Thunderstorm as 4 (Very Likely). This is because downbursts are often associated with severe thunderstorms, and the violent downdrafts can lead to the formation of electrical discharges. The strong winds and
Extracted Likelihood: 4.0
Running: Downburst, Coastal Flood

    Based on my training data, I would rate the likelihood of Downburst causing Coastal Flood as a 3 - Likely. This is because downbursts can generate strong surface winds that could potentially cause coastal flooding, especially if