# Llama2 working on Colab as a demonstration 
Only use to run on Colab. Unfortunately, due to hardware limitations of the T4 GPU in the free tier, we do not recommend generating the matrix using this script on Colab, as it is very slow and will (probably) not complete within our lifetime. Instead, please consider using the CS cluster and the run_assocMat.sh script to generate instead.

## Step 0: Install ctransformers

In [1]:
!pip3 install ctransformers

Collecting ctransformers
  Downloading ctransformers-0.2.27-py3-none-any.whl (9.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m38.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: ctransformers
Successfully installed ctransformers-0.2.27


### Link Google Drive to Colab and upload Hazards Excel Spreadsheet.

In [2]:
import pandas as pd
import re
import io
import os
from google.colab import drive, files

# upload hazard_definitions.xlsx
drive.mount('/content/gdrive/')
uploaded = files.upload()

# read the excel spreadsheet using pandas
df = pd.read_excel(io.BytesIO(uploaded['hazard_definitions.xlsx']))
df

Mounted at /content/gdrive/


Saving hazard_definitions.xlsx to hazard_definitions.xlsx


Unnamed: 0,Hazard_Code,Hazard_Category,Hazard_Subcategory,Hazard_Name,Hazard_Description,Upstream_Hazards,Excluded_Hazards,Synonyms,Keywords,Keywords_Operator,Questions,Confused_Hazards
0,MH0001,METEOROLOGICAL AND HYDROLOGICAL,Convective-Related,Downburst,Downburst A downburst is a violent and damagin...,,,"Microburst, Macroburst, Wind Sear",Wind,OR,Did a downburst occur?,"ET0001, MH0003, MH0050"
1,MH0002,METEOROLOGICAL AND HYDROLOGICAL,Convective-Related,Lightning (Electrical Storm),Lightning Lightning is the luminous manifestat...,,,"Bolt, Thunderbolt, Bolt-from-the-blue, Firebol...","Lightning, storm",OR,Was there lightning?,"MH0003, GH0018"
2,MH0003,METEOROLOGICAL AND HYDROLOGICAL,Convective-Related,Thunderstorm,Thunderstorm A thunderstorm is defined as one ...,MH0002,,,storm,OR,Was there a thunderstorm?,"MH0002, MH0001, MH0006"
3,MH0004,METEOROLOGICAL AND HYDROLOGICAL,Flood,Coastal Flood,Coastal Flood Coastal flooding is most frequen...,,,"Storm Surge, Coastal inundation",Flood,OR,Did the even take place at the coast?,"MH0005, MH0027, MH0028"
4,MH0005,METEOROLOGICAL AND HYDROLOGICAL,Flood,Estuarine (Coastal) Flood,Estuarine Flood Estuarine flooding is flooding...,MH0004,,"Flood, Flooding, Coastal flooding",river,OR,Was the river flooding caused by coastal flood...,"MH0004, MH0028, MH0007"
...,...,...,...,...,...,...,...,...,...,...,...,...
297,SO0004,SOCIETAL,Post-Conflict,Explosive Remnants of War,Explosive Remnants of War Explosive remnants o...,SO0003,,"Unexploded ordnance, Abandoned explosive ordnance","explosive, crater, mines",AND,Were there any reports of explosive remnants o...,TL0053
298,SO0005,SOCIETAL,Post-Conflict,Environmental Degradation from Conflict,Environmental Degradation from Conflict Enviro...,SO0003,,"Ecological degradation, Environmental damage","pollution, war",AND,Were there any reports of environmental degrad...,"EN0004, EN0005, BI0007"
299,SO0006,SOCIETAL,Behavioural,Violence,Violence Violence refers to the intentional or...,SO0003,,,violence,OR,Were there any reports of violence?,"SO0003, SO0001"
300,SO0007,SOCIETAL,Behavioural,Stampede or Crushing (Human),Stampede or Crushing Stampede or crushing is t...,SO0003,,"Crush, Mass panic, Crowd disaster","stampede, crushing",OR,Were there any reports of stampede or crushing?,


## Step 0: Make sure the Nvidia GPU is running for maximum efficiency.

In [3]:
!nvidia-smi

Sat Mar  9 02:34:42 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

## Step 1: Shell script to download the model as well as set up the directories and requirements.

In [4]:
sh = """
#!/bin/bash

# Change directory to within the AssociationMatrix folder
cd "$(dirname "$0")"
parent_dir=$(dirname "$(pwd)")
models_dir="./models"

# Install requirements using pip

# Check if the models directory exists
if [ -d "$models_dir" ]; then
    echo "Models directory already exists. Skipping download from Hugging Face."
else
    # Create models directory
    mkdir -p "$models_dir"

    # Download file from Hugging Face (replace with the actual URL)
    hugging_face_url="https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/resolve/main/llama-2-13b-chat.Q4_K_M.gguf"
    wget "$hugging_face_url" -O "$models_dir/13B-chat-GGUF-q4_K_M.gguf"
    echo "File downloaded from Hugging Face successfully."
fi
"""
with open('script.sh', 'w') as file:
  file.write(sh)

# run bash script
!bash script.sh

--2024-03-09 02:35:00--  https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/resolve/main/llama-2-13b-chat.Q4_K_M.gguf
Resolving huggingface.co (huggingface.co)... 18.239.50.103, 18.239.50.16, 18.239.50.80, ...
Connecting to huggingface.co (huggingface.co)|18.239.50.103|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/repos/8d/b1/8db1d1f73b4caa58e947ccbfe2fb27ac5e495c2ad8457ad299d15987aee3b520/7ddfe27f61bf994542c22aca213c46ecbd8a624cca74abff02a7b5a8c18f787f?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27llama-2-13b-chat.Q4_K_M.gguf%3B+filename%3D%22llama-2-13b-chat.Q4_K_M.gguf%22%3B&Expires=1710207112&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxMDIwNzExMn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9yZXBvcy84ZC9iMS84ZGIxZDFmNzNiNGNhYTU4ZTk0N2NjYmZlMmZiMjdhYzVlNDk1YzJhZDg0NTdhZDI5OWQxNTk4N2FlZTNiNTIwLzdkZGZlMjdmNjFiZjk5NDU0MmMyMmFjYTIxM2M0NmVjYm

## Step 2: Assuming that folder "models" exists, set up Llama2 using ctransformers
This version of Llama-13b-chat is quantised down to 4 bits since anything larger leads to unexpected behaviour in Colab.

In [5]:
# load ctransformers and set up parameters
from ctransformers import AutoModelForCausalLM
llm = AutoModelForCausalLM.from_pretrained(
    model_path_or_repo_id="./models/13B-chat-GGUF-q4_K_M.gguf",
    model_file="13B-chat-GGUF-q4_K_M.gguf",
    model_type="llama",
    max_new_tokens=50,
    repetition_penalty=1.2,
    temperature=0.25,
    top_p=0.95,
    top_k=150,
    threads=2,
    batch_size=10,
    gpu_layers=25,
)

## Step 3: Define some helper functions to output correct scores.

In [6]:
def extract_score(response):
    """
    Extracts the likelihood score from the user's response.

    Parameters:
    response (str): The user's response.

    Returns:
    float: The extracted likelihood score if found, -1 otherwise.
    """
    user_input = response

    # use regular expression to extract the number between 0 and 5"
    likelihood_match = re.search(r'\b([0-5]|\([0-5]\))\b', user_input)

    # check if a match is found and print the extracted number
    if likelihood_match:
        extracted_likelihood = float(likelihood_match.group(1) or likelihood_match.group(2))
        print("Extracted Likelihood:", extracted_likelihood)
        return extracted_likelihood
    else:
        print("No likelihood score found.")
        return -1

## Step 4: Create ~ 303^2 combinations and feed through to LLM model.

In [7]:
# take combinations
from itertools import product

# extracting unique values from the desired column
unique_values = df['Hazard_Name'].unique() # change to create AxA matrix, remove to make all
print(unique_values)

# create double-sided pairs without pairs with the same element
pairs = [(x, y) for x, y in product(unique_values, unique_values) if x != y]

['Downburst' 'Lightning (Electrical Storm)' 'Thunderstorm' 'Coastal Flood'
 'Estuarine (Coastal) Flood' 'Flash Flood' 'Fluvial (Riverine) Flood'
 'Groundwater Flood' 'Ice-Jam Flood Including Debris'
 'Ponding (Drainage) Flood' 'Snowmelt Flood' 'Surface Water Flooding'
 'Glacial Lake Outburst Flood' 'Black Carbon (Brown Clouds)'
 'Dust storm or Sandstorm' 'Fog' 'Haze' 'Polluted Air' 'Sand haze' 'Smoke'
 'Ocean Acidification' 'Rogue Wave' 'Sea Water Intrusion'
 'Sea Ice (Ice Bergs)' 'Ice Flow' 'Seiche' 'Storm Surge' 'Storm Tides'
 'Tsunami' 'Depression or Cyclone (Low Pressure Area)'
 'Extra-tropical Cyclone' 'Sub-Tropical Cyclone' 'Acid Rain' 'Blizzard'
 'Drought' 'Hail' 'Ice Storm' 'Snow' 'Snow Storm' 'Cold Wave' 'Dzud'
 'Freeze' 'Frost (Hoar Frost)' 'Freezing Rain (Supercooled Rain)' 'Glaze'
 'Ground Frost' 'Heatwave' 'Icing (Including Ice)' 'Thaw' 'Avalanche'
 'Mud Flow' 'Rock slide' 'Derecho' 'Gale (Strong Gale)' 'Squall'
 'Subtropical Storm' 'Tropical Cyclone (Cyclonic Wind/ Rain S

## Step 5: Create new dataframes to store new data.

In [8]:
# Check if the excel files already exist
if os.path.exists("/content/gdrive/MyDrive/associations/scores.xlsx") and os.path.exists("/content/gdrive/MyDrive/associations/justifications.xlsx"):
    # Load the existing excel files
    scores_df = pd.read_excel("/content/gdrive/MyDrive/associations/scores.xlsxx", index_col=0, dtype=str)
    justification_df = pd.read_excel("./content/gdrive/MyDrive/associations/justifications.xlsx", index_col=0, dtype=str)
else:
    # Create new excel files
    # print("poo")
    scores_df = pd.DataFrame(index=unique_values, columns=unique_values)
    justification_df = pd.DataFrame(index=unique_values, columns=unique_values)

scores_df

Unnamed: 0,Downburst,Lightning (Electrical Storm),Thunderstorm,Coastal Flood,Estuarine (Coastal) Flood,Flash Flood,Fluvial (Riverine) Flood,Groundwater Flood,Ice-Jam Flood Including Debris,Ponding (Drainage) Flood,...,Road Traffic Accident,Explosive agents,International Armed Conflict (IAC),Non-International Armed Conflict (NIAC),Civil Unrest,Explosive Remnants of War,Environmental Degradation from Conflict,Violence,Stampede or Crushing (Human),Financial shock
Downburst,,,,,,,,,,,...,,,,,,,,,,
Lightning (Electrical Storm),,,,,,,,,,,...,,,,,,,,,,
Thunderstorm,,,,,,,,,,,...,,,,,,,,,,
Coastal Flood,,,,,,,,,,,...,,,,,,,,,,
Estuarine (Coastal) Flood,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Explosive Remnants of War,,,,,,,,,,,...,,,,,,,,,,
Environmental Degradation from Conflict,,,,,,,,,,,...,,,,,,,,,,
Violence,,,,,,,,,,,...,,,,,,,,,,
Stampede or Crushing (Human),,,,,,,,,,,...,,,,,,,,,,


### Step 5.5: Find the first pair which does not have a value stored

In [9]:
def find_first_missing_pair(df):
    '''
    Finds the first pair of row and column categories in a DataFrame where the corresponding element is missing.

    Parameters:
    - df (pd.DataFrame): The DataFrame to search for missing values.

    Returns:
    - tuple or None: If a missing value is found, returns a tuple containing the row and column categories of the missing element. If no missing values are found, returns None.
    '''
    categories = df.columns  # Exclude the first column which contains row headers
    for i, row_category in enumerate(categories):
        for j, col_category in enumerate(categories):
            if i != j:  # Exclude pairs where both elements are the same
                value = df.iloc[i, j]  # j+1 to account for the row header column
                if pd.isna(value):  # Check if the value is NaN
                    print(i, j)
                    return (row_category, col_category)
    return None

missing_value = find_first_missing_pair(scores_df)
missing_value

0 1


('Downburst', 'Lightning (Electrical Storm)')

## Step 6: Run the LLM.

In [10]:
def run_llm(hazard1, hazard2, def1, def2):
    '''
    Run a likelihood assessment using a Language Model (LLM) to evaluate the likelihood that a given hazard1 causes hazard2.

    Parameters:
    - hazard1 (str): The first hazard in the assessment.
    - hazard2 (str): The second hazard in the assessment.
    - def1 (str): The definition of the first hazard.
    - def2 (str): The definition of the second hazard.

    Returns:
    tuple: A tuple containing the numerical score representing the likelihood assessment (ranging from 0 to 5) and the detailed response from the language model.
    '''
    # cut definitions to be only first sentence
    def1 = def1.split(".")[0]
    def2 = def2.split(".")[0]

    # begin prompting
    prompt = f"""What is the likelihood that {hazard1} causes {hazard2}, bearing in mind:
    {hazard1}: {def1}
    {hazard2}: {def2}
    """

    super_prompt = f"""
    SYSTEM: We're evaluating the likelihood of various hazards causing specific outcomes. Your responses should be one number between 0 and 5, following the below scale. Include a short explanation for your score, as it helps understand the reasoning behind your assessment.

    - 0: Almost never
    - 1: Very Unlikely
    - 2: Unlikely
    - 3: Likely
    - 4: Very likely
    - 5: Almost always

    Given the above, consider the following query:

    USER: {prompt}

    ASSISTANT:
    """
    response = llm(super_prompt)
    print(f"Running: {hazard1}, {hazard2}")
    print(response)
    score = extract_score(response)
    print("====================================")
    return (score, response)


Step 7.5: Find the first pair which does not have a value stored

In [11]:
pairs

[('Downburst', 'Lightning (Electrical Storm)'),
 ('Downburst', 'Thunderstorm'),
 ('Downburst', 'Coastal Flood'),
 ('Downburst', 'Estuarine (Coastal) Flood'),
 ('Downburst', 'Flash Flood'),
 ('Downburst', 'Fluvial (Riverine) Flood'),
 ('Downburst', 'Groundwater Flood'),
 ('Downburst', 'Ice-Jam Flood Including Debris'),
 ('Downburst', 'Ponding (Drainage) Flood'),
 ('Downburst', 'Snowmelt Flood'),
 ('Downburst', 'Surface Water Flooding'),
 ('Downburst', 'Glacial Lake Outburst Flood'),
 ('Downburst', 'Black Carbon (Brown Clouds)'),
 ('Downburst', 'Dust storm or Sandstorm'),
 ('Downburst', 'Fog'),
 ('Downburst', 'Haze'),
 ('Downburst', 'Polluted Air'),
 ('Downburst', 'Sand haze'),
 ('Downburst', 'Smoke'),
 ('Downburst', 'Ocean Acidification'),
 ('Downburst', 'Rogue Wave'),
 ('Downburst', 'Sea Water Intrusion'),
 ('Downburst', 'Sea Ice (Ice Bergs)'),
 ('Downburst', 'Ice Flow'),
 ('Downburst', 'Seiche'),
 ('Downburst', 'Storm Surge'),
 ('Downburst', 'Storm Tides'),
 ('Downburst', 'Tsunami'),


## Step 7: Record and update spreadsheets.

In [None]:
# store the scores and justifications in the new dataframes
i = 0
# missing_value contains the first pair that does not have a value stored, slice the pairs list to start from there
pairs = pairs[pairs.index(missing_value):]

for pair in pairs:
    data = run_llm(pair[0], pair[1], df[df['Hazard_Name'] == pair[0]]['Hazard_Description'].values[0], df[df['Hazard_Name'] == pair[1]]['Hazard_Description'].values[0])
    scores_df.loc[pair[0], pair[1]] = data[0]
    justification_df.loc[pair[0], pair[1]] = data[1]
    if i % 20 == 0:
      scores_df.to_excel(excel_writer="/content/gdrive/MyDrive/associations/scores.xlsx")
      justification_df.to_excel(excel_writer="/content/gdrive/MyDrive/associations/justification.xlsx")
    i+=1

Running: Downburst, Lightning (Electrical Storm)

    Based on my knowledge of these phenomena and their relationships, I would rate the likelihood of Downburst causing Lightning (Electrical Storm) as follows:

         Likelihood: 4 - Very likely.

Extracted Likelihood: 4.0


## Step 8: Rerun all instances of -1, until there are no more -1's

In [None]:
def find_invalid_pairs():
    '''
    Finds pairs of row and column categories in the scores DataFrame where the corresponding element is -1

    Returns:
    - list: A list of tuples containing the row and column categories of the missing elements.
    '''
    scores_df = pd.read_excel("/content/gdrive/MyDrive/associations/scores.xlsx", index_col=0, dtype=str)
    
    # check which pairs are equal to -1
    invalid_pairs = []
    for i, row in scores_df.iterrows():
        for j, value in row.items():
            if value == "-1":
                invalid_pairs.append((i, j))
    
    return invalid_pairs

In [None]:
# add the invalid pairs to update score_df
j = 0
invalid_pairs = find_invalid_pairs()

# until no more invalid pairs
while invalid_pairs:
    for pair in invalid_pairs:
        data = run_llm(pair[0], pair[1], df[df['Hazard_Name'] == pair[0]]['Hazard_Description'].values[0], df[df['Hazard_Name'] == pair[1]]['Hazard_Description'].values[0])
        scores_df.loc[pair[0], pair[1]] = data[0]
        justification_df.loc[pair[0], pair[1]] = data[1]
        if j % 20 == 0:
            scores_df.to_excel("/content/gdrive/MyDrive/associations/scores.xlsx")
            justification_df.to_excel("/content/gdrive/MyDrive/associations/justifications.xlsx")
        j+=1
    # reset, and check again
    j = 0
    invalid_pairs = find_invalid_pairs()