<a href="https://colab.research.google.com/github/raym2828/Span-ASTE/blob/main/Notebooks/Llama2_prod_r5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Llama2 Production

SUMA 26/09/2023

#Instructions

Running code on Google Colab
1.   Create HuggingFace account
2.   Gain access to model: https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
3.   Create folder 'data' in project folder on personal Google Drive
4.   Save data set CrisiLexT26 in folder 'data'
5.   Data sourced from: https://crisislex.org/data-collections.html#CrisisLexT26
6.   Mount Google Drive
7.   Update code with Paths
8.   Set system up as per steps below
9.   Run Experiments
10.  Evaluate

## Setup

In [1]:
# Install Transformer package
!pip install -q transformers einops accelerate langchain bitsandbytes

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m57.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.6/44.6 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.1/258.1 kB[0m [31m30.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m70.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m30.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m101.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m79.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
# Import Packages
from langchain import PromptTemplate,  LLMChain
from langchain import HuggingFacePipeline
from transformers import AutoTokenizer
import transformers
import torch
import pandas as pd
import numpy as np
import json
import os
import csv
import time
import re
import matplotlib.pyplot as plt
import json



In [3]:
# Mounting GDrive
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


### Prerequisites

To load our desired model, `meta-llama/Llama-2-7b-chat-hf`, we first need to authenticate ourselves on Hugging Face. This ensures we have the correct permissions to fetch the model.

1. Gain access to the model on Hugging Face: [Link](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf).
2. Use the Hugging Face CLI to login and verify your authentication status.


In [4]:
# Logging onto HuggingFace
!huggingface-cli login



    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
    
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [5]:
!pip install sentencepiece


Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.99


## Preparing data

In [6]:
# Read (timestamps)

# Path to the main folder
main_folder = '/content/gdrive/MyDrive/iLab2/data/CrisisLexT26'

# Function to process CSV files in a folder
def process_folder(folder_path):
    dfs = []
    for root, _, files in os.walk(folder_path):
        for file in files:
            if file.endswith('.csv') and 'period' in file.lower():
                file_path = os.path.join(root, file)
                df = pd.read_csv(file_path)
                dfs.append(df)
    return dfs

# Read and process each subfolder
combined_data = []
for subfolder in os.listdir(main_folder):
    subfolder_path = os.path.join(main_folder, subfolder)
    if os.path.isdir(subfolder_path):
        subfolder_data = process_folder(subfolder_path)
        combined_data.extend(subfolder_data)

# Concatenate all data into one DataFrame
combined_df_p = pd.concat(combined_data, ignore_index=True)


# Remove spaces from column names
combined_df_p.columns = combined_df_p.columns.str.replace(' ', '')

# Rename Columns in prep for left join
combined_df_p.rename(columns={'Tweet-ID': 'Tweet ID'}, inplace=True)

# # Save the combined DataFrame to a CSV file
# output_file = 'combined_data.csv'
# combined_df.to_csv(output_file, index=False)

print(f"Combined timestamps read")


Combined timestamps read


In [7]:
# Read tweets

# Path to the main folder
main_folder = '/content/gdrive/MyDrive/iLab2/data/CrisisLexT26'

# Function to process CSV files in a folder, add folder name as label, and add subfolder name as a column
def process_folder(folder_path, label, subfolder_name):
    dfs = []
    for root, _, files in os.walk(folder_path):
        for file in files:
            if file.endswith('.csv') and 'labeled' in file.lower():
                file_path = os.path.join(root, file)
                df = pd.read_csv(file_path)
                df['Label'] = label
                df['subfolder_name'] = subfolder_name  # Add subfolder name as a column
                dfs.append(df)
    return dfs

# Read and process each subfolder
combined_data = []
for subfolder in os.listdir(main_folder):
    subfolder_path = os.path.join(main_folder, subfolder)
    if os.path.isdir(subfolder_path):
        label = subfolder
        subfolder_data = process_folder(subfolder_path, label, subfolder)
        combined_data.extend(subfolder_data)

# Concatenate all data into one DataFrame
combined_df_l = pd.concat(combined_data, ignore_index=True)

# Print the first few rows of the combined DataFrame
print(f"Combined tweets read")
combined_df_l.head(10)


Combined tweets read


Unnamed: 0,Tweet ID,Tweet Text,Information Source,Information Type,Informativeness,Label,subfolder_name
0,242883454050648064,"#earthquake M 3.3, Virgin Islands region: Sept...",Not labeled,Not labeled,Not related,2012_Costa_Rica_earthquake,2012_Costa_Rica_earthquake
1,242887379944366080,@EarthquakeTest update your #earthquake s more,Not labeled,Not labeled,Not related,2012_Costa_Rica_earthquake,2012_Costa_Rica_earthquake
2,242919634125328384,"RT @RedazioneWebAL: #Terremoto, Costi (Pd): ta...",Not labeled,Not labeled,Not related,2012_Costa_Rica_earthquake,2012_Costa_Rica_earthquake
3,242920737223106561,"#Earthquake M 2.6, Southern Alaska http://t.co...",Not labeled,Not labeled,Not related,2012_Costa_Rica_earthquake,2012_Costa_Rica_earthquake
4,242936558158757889,５年６ヶ月長期保存可能なえいようかん5本。http://t.co/ZlSVctfi #eqj...,Not labeled,Not labeled,Not applicable,2012_Costa_Rica_earthquake,2012_Costa_Rica_earthquake
5,242945299084103681,#earthquake sige najud ug linug diri ui.. =.=,Not labeled,Not labeled,Not applicable,2012_Costa_Rica_earthquake,2012_Costa_Rica_earthquake
6,242950005076393985,@MRmusica pasenla super bien :) Mil besos desd...,Not labeled,Not labeled,Not related,2012_Costa_Rica_earthquake,2012_Costa_Rica_earthquake
7,242951548576088065,RT @TerreInMoto: I Negozi di Mirandola 3 Mesi ...,Not labeled,Not labeled,Not related,2012_Costa_Rica_earthquake,2012_Costa_Rica_earthquake
8,242979541373562880,#earthquake Philippines (the): NDRRMC Update r...,Not labeled,Not labeled,Not related,2012_Costa_Rica_earthquake,2012_Costa_Rica_earthquake
9,242997841134505985,"【#USGS #Breaking】 M 1.1, 28km SSW of Fairbanks...",Not labeled,Not labeled,Not applicable,2012_Costa_Rica_earthquake,2012_Costa_Rica_earthquake


In [8]:
# Check
combined_df_p.columns
combined_df_p.head()


Unnamed: 0,Timestamp,Tweet ID,Included(Y/N)
0,Sun Sep 16 06:07:51 +0000 2012,247215201198436352,Y
1,Sun Sep 16 06:08:01 +0000 2012,247215243137257472,Y
2,Sun Sep 16 06:10:16 +0000 2012,247215809372516353,Y
3,Sun Sep 16 06:39:34 +0000 2012,247223182979911681,Y
4,Sun Sep 16 06:41:20 +0000 2012,247223627584520192,Y


In [9]:
# Extract earliest timestamp

# Convert it to datetime first
combined_df_p['Timestamp'] = pd.to_datetime(combined_df_p['Timestamp'])

# Group by 'Tweet-ID' and aggregate to find the earliest timestamp
earliest_timestamps = combined_df_p.groupby('Tweet ID')['Timestamp'].agg('min').reset_index()

# Display or use the resulting DataFrame 'earliest_timestamps'
# print(earliest_timestamps)

In [10]:
# Left join

# Perform a left join on 'Tweet ID'
result_df = combined_df_l.merge(earliest_timestamps, on='Tweet ID', how='left')

# Display
# print(result_df)
# result_df.shape
result_df.head()

Unnamed: 0,Tweet ID,Tweet Text,Information Source,Information Type,Informativeness,Label,subfolder_name,Timestamp
0,242883454050648064,"#earthquake M 3.3, Virgin Islands region: Sept...",Not labeled,Not labeled,Not related,2012_Costa_Rica_earthquake,2012_Costa_Rica_earthquake,2012-09-04 07:15:02+00:00
1,242887379944366080,@EarthquakeTest update your #earthquake s more,Not labeled,Not labeled,Not related,2012_Costa_Rica_earthquake,2012_Costa_Rica_earthquake,2012-09-04 07:30:38+00:00
2,242919634125328384,"RT @RedazioneWebAL: #Terremoto, Costi (Pd): ta...",Not labeled,Not labeled,Not related,2012_Costa_Rica_earthquake,2012_Costa_Rica_earthquake,2012-09-04 09:38:48+00:00
3,242920737223106561,"#Earthquake M 2.6, Southern Alaska http://t.co...",Not labeled,Not labeled,Not related,2012_Costa_Rica_earthquake,2012_Costa_Rica_earthquake,2012-09-04 09:43:11+00:00
4,242936558158757889,５年６ヶ月長期保存可能なえいようかん5本。http://t.co/ZlSVctfi #eqj...,Not labeled,Not labeled,Not applicable,2012_Costa_Rica_earthquake,2012_Costa_Rica_earthquake,2012-09-04 10:46:03+00:00


In [11]:
# Read

# Define the main folder path
main_folder = '/content/gdrive/MyDrive/iLab2/data/CrisisLexT26'

# Initialize empty lists to store the extracted data
subfolder_names = []
names = []
start_days = []
durations = []
countries = []
location_descriptions = []
sub_categories = []
types = []

# Loop through the subfolders
for subfolder in os.listdir(main_folder):
    subfolder_path = os.path.join(main_folder, subfolder)

    # Check if it's a directory
    if os.path.isdir(subfolder_path):
        json_files = [f for f in os.listdir(subfolder_path) if f.endswith('.json')]

        # Loop through JSON files in the subfolder
        for json_file in json_files:
            json_path = os.path.join(subfolder_path, json_file)

            # Read the JSON file
            with open(json_path, 'r') as f:
                data = json.load(f)

                # Extract the required information
                subfolder_names.append(subfolder)
                names.append(data['name'])
                start_days.append(data['time']['start_day'])
                durations.append(data['time']['duration'])
                countries.append(data['location']['country'])
                location_descriptions.append(data['location']['location_description'])
                sub_categories.append(data['categorization']['sub_category'])
                types.append(data['categorization']['type'])

# Create a DataFrame from the extracted data
df_annotation = pd.DataFrame({
    'subfolder_name': subfolder_names,
    'name': names,
    'start_day': start_days,
    'duration': durations,
    'country': countries,
    'location_description': location_descriptions,
    'sub_category': sub_categories,
    'type': types
})

# Display the DataFrame
df_annotation.head()


Unnamed: 0,subfolder_name,name,start_day,duration,country,location_description,sub_category,type
0,2012_Costa_Rica_earthquake,Costa Rica earthquake,2012-09-04,13,Costa Rica,Costa Rica,Geophysical,Earthquake
1,2012_Venezuela_refinery,Venezuela refinery explosion,2012-08-24,12,Venezuela,Falcon,Unintentional,Explosion
2,2013_Alberta_floods,Alberta Floods,17/06/2013,25,Canada,Alberta,Hydrological,Floods
3,2013_Australia_bushfire,Australia wildfires,2013-10-12,21,Australia,New South Wales,Climatological,Wildfire
4,2012_Guatemala_earthquake,Guatemala earthquake,2012-11-06,20,Guatemala,Guatemala,Geophysical,Earthquake


In [12]:
# Left join

# Perform a left join on the 'subfolder'
result_df = result_df.merge(df_annotation, left_on='subfolder_name', right_on='subfolder_name', how='left')

# Drop the duplicate 'subfolder_name'
result_df.drop(columns=['subfolder_name'], inplace=True)


In [13]:
# Count the number of rows in result_df
row_count = result_df.shape[0]

# Print the row count
print(f"Number of rows in result_df: {row_count}")

result_df.head()


Number of rows in result_df: 27933


Unnamed: 0,Tweet ID,Tweet Text,Information Source,Information Type,Informativeness,Label,Timestamp,name,start_day,duration,country,location_description,sub_category,type
0,242883454050648064,"#earthquake M 3.3, Virgin Islands region: Sept...",Not labeled,Not labeled,Not related,2012_Costa_Rica_earthquake,2012-09-04 07:15:02+00:00,Costa Rica earthquake,2012-09-04,13,Costa Rica,Costa Rica,Geophysical,Earthquake
1,242887379944366080,@EarthquakeTest update your #earthquake s more,Not labeled,Not labeled,Not related,2012_Costa_Rica_earthquake,2012-09-04 07:30:38+00:00,Costa Rica earthquake,2012-09-04,13,Costa Rica,Costa Rica,Geophysical,Earthquake
2,242919634125328384,"RT @RedazioneWebAL: #Terremoto, Costi (Pd): ta...",Not labeled,Not labeled,Not related,2012_Costa_Rica_earthquake,2012-09-04 09:38:48+00:00,Costa Rica earthquake,2012-09-04,13,Costa Rica,Costa Rica,Geophysical,Earthquake
3,242920737223106561,"#Earthquake M 2.6, Southern Alaska http://t.co...",Not labeled,Not labeled,Not related,2012_Costa_Rica_earthquake,2012-09-04 09:43:11+00:00,Costa Rica earthquake,2012-09-04,13,Costa Rica,Costa Rica,Geophysical,Earthquake
4,242936558158757889,５年６ヶ月長期保存可能なえいようかん5本。http://t.co/ZlSVctfi #eqj...,Not labeled,Not labeled,Not applicable,2012_Costa_Rica_earthquake,2012-09-04 10:46:03+00:00,Costa Rica earthquake,2012-09-04,13,Costa Rica,Costa Rica,Geophysical,Earthquake


## Modeling

In [14]:
# Tokenizer
model = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model)

pipeline = transformers.pipeline(
    "text-generation", #task
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
    max_length=1000,
    do_sample=True,
    top_k=10,

    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id
)

Downloading (…)okenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

In [15]:
# Load model
llm = HuggingFacePipeline(pipeline = pipeline, model_kwargs = {'temperature':0})


In [16]:
# Prompt Template Formatting
B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<>\n", "\n<>\n\n"
DEFAULT_SYSTEM_PROMPT = """\
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."""

def get_prompt(instruction, new_system_prompt=DEFAULT_SYSTEM_PROMPT ):
    SYSTEM_PROMPT = B_SYS + new_system_prompt + E_SYS
    prompt_template =  B_INST + SYSTEM_PROMPT + instruction + E_INST
    return prompt_template

def parse_text(text):
        wrapped_text = textwrap.fill(text, width=100)
        print(wrapped_text +'\n\n')
        # return assistant_text

In [17]:
def generate_answer(text, system_prompt, instruction):
  prompt_template = get_prompt(instruction, system_prompt)
  prompt = PromptTemplate(template=prompt_template, input_variables= ['text'])
  llm_chain = LLMChain(prompt=prompt, llm=llm)
  text = llm_chain.run(text)
  return text

In [18]:
data = {
    'TweetID': [25454],
    'TweetText': ['MT @COEmergency Colorado Avoid main road because its blocked by wildfire'],
    'InformationSource': ['Government'],
    'InformationType': ['Donations and volunteering'],
    'Informativeness Label': ['Related and informative'],
    'Timestamp': ['2013-09-17 10:43:12+00:00'],
    'answer': ['']
}

df = pd.DataFrame(data)
df


Unnamed: 0,TweetID,TweetText,InformationSource,InformationType,Informativeness Label,Timestamp,answer
0,25454,MT @COEmergency Colorado Avoid main road becau...,Government,Donations and volunteering,Related and informative,2013-09-17 10:43:12+00:00,


In [19]:
# Create a function using the code below to generate a answer for each row in the dataframe. Then apply the function to the dataframe to generate the answer and save the output to 'answer' in df.
def generate_answer(text, system_prompt, instruction):
    # # Define the instruction
    # instruction = instruction + row['TweetText']
    #define the prompt template
    prompt_template = get_prompt(instruction, system_prompt)
    #create prompt
    prompt = PromptTemplate(template=prompt_template, input_variables= ["text"])
    #create the language model chain
    llm_chain = LLMChain(prompt=prompt, llm=llm)
    #generate text using the prompt template
    text = llm_chain.run(text)
    #return the text
    return text


In [None]:
# Apply the function to the dataframe to generate the answer and save the output to 'answer' in df.
system_prompt = "You are an advance assistant that excels at classifying whether the text that contains useful information during or after a crisis. A crisis can include bushfires, wildfires, floods, hurricanes, earthquakes, covid-19, pandemic etc. Stop after giving one answer.  Return one word. "
instruction = "Recognize whether this text is in relation to a crisis. If the text is in relation to a crisis, label in one word what is the name of the crisis. Some examples of crisis include [\"Coronavirus\", \"Flood\", \"Earthquake\"]. If no crisis exists, then only answer \"[]\". Return 1 word without explaination. The test is: {text}"

df['answer'] = df.apply(lambda row: generate_answer(row['TweetText'], system_prompt, instruction), axis=1)



In [None]:
# Check
df.head()


Unnamed: 0,TweetID,TweetText,InformationSource,InformationType,Informativeness Label,Timestamp,answer
0,25454,MT @COEmergency Colorado Avoid main road becau...,Government,Donations and volunteering,Related and informative,2013-09-17 10:43:12+00:00,Wildfire


In [20]:
# Grab 10 random samples
filtered_df = result_df.query("Label == '2012_Colorado_wildfires'")
testing_df = filtered_df.sample(n=10, random_state=42)
testing_df.columns = testing_df.columns.str.replace(' ', '')
testing_df['TweetText'] = testing_df['TweetText'].astype(str)


testing_df.head(12)


Unnamed: 0,TweetID,TweetText,InformationSource,InformationType,Informativeness,Label,Timestamp,name,start_day,duration,country,location_description,sub_category,type
8839,221503954108940290,PetMeds Joins Efforts to Help Pets Affected by...,Media,Donations and volunteering,Related and informative,2012_Colorado_wildfires,2012-07-07 07:20:32+00:00,Colorado wildfires,2012-06-08,31,US,Colorado,Climatological,Wildfire
8526,218318321509085185,Federal firefighters go uninsured: As brutal w...,Outsiders,Other Useful Information,Related and informative,2012_Colorado_wildfires,2012-06-28 12:21:58+00:00,Colorado wildfires,2012-06-08,31,US,Colorado,Climatological,Wildfire
7762,212412775027314690,So proud of the Colorado Springs Air Force gra...,Not labeled,Not labeled,Not related,2012_Colorado_wildfires,2012-06-12 05:15:26+00:00,Colorado wildfires,2012-06-08,31,US,Colorado,Climatological,Wildfire
8100,217406194590035968,I couldn't remember what happened to R.J. on #...,Not labeled,Not labeled,Not related,2012_Colorado_wildfires,2012-06-25 23:57:30+00:00,Colorado wildfires,2012-06-08,31,US,Colorado,Climatological,Wildfire
7719,212056699572469760,#telluride #colorado #landscape http://t.co/Y...,Not labeled,Not labeled,Not related,2012_Colorado_wildfires,2012-06-11 05:40:31+00:00,Colorado wildfires,2012-06-08,31,US,Colorado,Climatological,Wildfire
8781,220307872813285376,Anti-Tax movement region of Colorado hit by fi...,Media,Other Useful Information,Related and informative,2012_Colorado_wildfires,2012-07-04 00:07:44+00:00,Colorado wildfires,2012-06-08,31,US,Colorado,Climatological,Wildfire
7984,215821620097462272,USFS forest ecologist: Not just forests out of...,Government,Other Useful Information,Related and informative,2012_Colorado_wildfires,2012-06-21 15:00:58+00:00,Colorado wildfires,2012-06-08,31,US,Colorado,Climatological,Wildfire
8635,218692809920749572,Colorado Fire Relief benefit concerts at Red R...,NGOs,Donations and volunteering,Related and informative,2012_Colorado_wildfires,2012-06-29 13:10:03+00:00,Colorado wildfires,2012-06-08,31,US,Colorado,Climatological,Wildfire
8072,217059048850325504,"Lady, in her favorite perch above the backyard...",Not labeled,Not labeled,Not related,2012_Colorado_wildfires,2012-06-25 00:58:04+00:00,Colorado wildfires,2012-06-08,31,US,Colorado,Climatological,Wildfire
8516,218229725229887488,@KleinErin seems there are heaps of plane issu...,Outsiders,Other Useful Information,Related and informative,2012_Colorado_wildfires,2012-06-28 06:29:55+00:00,Colorado wildfires,2012-06-08,31,US,Colorado,Climatological,Wildfire


In [21]:
# Running over data

system_prompt = "You are an advance assistant that excels at classifying whether the text that contains useful information during or after a crisis. A crisis can include bushfires, wildfires, floods, hurricanes, earthquakes, covid-19, pandemic etc. Stop after giving one answer.  Return in max 3 words. "
instruction = "Recognize whether this text is in relation to a crisis. If the text is in relation to a crisis, label in one word what is the name of the crisis. Some examples of crisis include [\"Coronavirus\", \"Flood\", \"Earthquake\"]. If no crisis exists, then only answer \"[]\". Return in max 3 words without explaination. The test is: {text}"

# Create an empty 'answer' column in testing_df
testing_df['answer'] = ''

# Loop over the data
for index, row in testing_df.iterrows():
    TweetText = row['TweetText']

    # Generate the prompt
    # prompt_template = get_prompt(instruction, system_prompt)

    # Update the 'answer' column in the DataFrame using the output from generate_answer
    testing_df.loc[index, 'answer'] = generate_answer(TweetText, system_prompt, instruction)



In [22]:
# Check
testing_df.head(10)
#Print "answer" column in testing_df first row
# print(testing_df['answer'])



Unnamed: 0,TweetID,TweetText,InformationSource,InformationType,Informativeness,Label,Timestamp,name,start_day,duration,country,location_description,sub_category,type,answer
8839,221503954108940290,PetMeds Joins Efforts to Help Pets Affected by...,Media,Donations and volunteering,Related and informative,2012_Colorado_wildfires,2012-07-07 07:20:32+00:00,Colorado wildfires,2012-06-08,31,US,Colorado,Climatological,Wildfire,Crisis: Wildfire
8526,218318321509085185,Federal firefighters go uninsured: As brutal w...,Outsiders,Other Useful Information,Related and informative,2012_Colorado_wildfires,2012-06-28 12:21:58+00:00,Colorado wildfires,2012-06-08,31,US,Colorado,Climatological,Wildfire,Crisis: Wildfire
7762,212412775027314690,So proud of the Colorado Springs Air Force gra...,Not labeled,Not labeled,Not related,2012_Colorado_wildfires,2012-06-12 05:15:26+00:00,Colorado wildfires,2012-06-08,31,US,Colorado,Climatological,Wildfire,[]
8100,217406194590035968,I couldn't remember what happened to R.J. on #...,Not labeled,Not labeled,Not related,2012_Colorado_wildfires,2012-06-25 23:57:30+00:00,Colorado wildfires,2012-06-08,31,US,Colorado,Climatological,Wildfire,Crisis: Wildfire
7719,212056699572469760,#telluride #colorado #landscape http://t.co/Y...,Not labeled,Not labeled,Not related,2012_Colorado_wildfires,2012-06-11 05:40:31+00:00,Colorado wildfires,2012-06-08,31,US,Colorado,Climatological,Wildfire,Covid-19
8781,220307872813285376,Anti-Tax movement region of Colorado hit by fi...,Media,Other Useful Information,Related and informative,2012_Colorado_wildfires,2012-07-04 00:07:44+00:00,Colorado wildfires,2012-06-08,31,US,Colorado,Climatological,Wildfire,Crisis: Fire
7984,215821620097462272,USFS forest ecologist: Not just forests out of...,Government,Other Useful Information,Related and informative,2012_Colorado_wildfires,2012-06-21 15:00:58+00:00,Colorado wildfires,2012-06-08,31,US,Colorado,Climatological,Wildfire,Fire
8635,218692809920749572,Colorado Fire Relief benefit concerts at Red R...,NGOs,Donations and volunteering,Related and informative,2012_Colorado_wildfires,2012-06-29 13:10:03+00:00,Colorado wildfires,2012-06-08,31,US,Colorado,Climatological,Wildfire,Crisis: Wildfire
8072,217059048850325504,"Lady, in her favorite perch above the backyard...",Not labeled,Not labeled,Not related,2012_Colorado_wildfires,2012-06-25 00:58:04+00:00,Colorado wildfires,2012-06-08,31,US,Colorado,Climatological,Wildfire,Crisis: []
8516,218229725229887488,@KleinErin seems there are heaps of plane issu...,Outsiders,Other Useful Information,Related and informative,2012_Colorado_wildfires,2012-06-28 06:29:55+00:00,Colorado wildfires,2012-06-08,31,US,Colorado,Climatological,Wildfire,Crisis: Fire


## Experiments

In [28]:
# Running over data

system_prompt = "You are an advance assistant that excels at classifying whether the text that contains useful information during or after a crisis. A crisis can include bushfires, wildfires, floods, hurricanes, earthquakes, covid-19, pandemic etc. Stop after giving one answer. If it is not related to a crisis return '''Not applicable'''. Only return one label that is the most appropriate without explaination. "
instruction = "Classify the following text for me using these crisis information labels [ \"Caution and advice for residents\", \"Written by Affected individuals \", \"Infrastructure and utilities damage\", \"Soliciting Donations or volunteering to help \", \"expressing Sympathy and support for affected\", \"other useful information\", \"Not applicable\"]. Return one label without explaination. {text}"

# Create an empty 'answer' column in testing_df
testing_df['answer'] = ''

# Loop over the data
for index, row in testing_df.iterrows():
    TweetText = row['TweetText']

    # Generate the prompt
    # prompt_template = get_prompt(instruction, system_prompt)

    # Update the 'answer' column in the DataFrame using the output from generate_answer
    testing_df.loc[index, 'answer'] = generate_answer(TweetText, system_prompt, instruction)

    testing_df.head(10)



In [29]:
testing_df.head(10)

Unnamed: 0,TweetID,TweetText,InformationSource,InformationType,Informativeness,Label,Timestamp,name,start_day,duration,country,location_description,sub_category,type,answer
8839,221503954108940290,PetMeds Joins Efforts to Help Pets Affected by...,Media,Donations and volunteering,Related and informative,2012_Colorado_wildfires,2012-07-07 07:20:32+00:00,Colorado wildfires,2012-06-08,31,US,Colorado,Climatological,Wildfire,"""Caution and advice for residents"""
8526,218318321509085185,Federal firefighters go uninsured: As brutal w...,Outsiders,Other Useful Information,Related and informative,2012_Colorado_wildfires,2012-06-28 12:21:58+00:00,Colorado wildfires,2012-06-08,31,US,Colorado,Climatological,Wildfire,Not applicable
7762,212412775027314690,So proud of the Colorado Springs Air Force gra...,Not labeled,Not labeled,Not related,2012_Colorado_wildfires,2012-06-12 05:15:26+00:00,Colorado wildfires,2012-06-08,31,US,Colorado,Climatological,Wildfire,Not applicable
8100,217406194590035968,I couldn't remember what happened to R.J. on #...,Not labeled,Not labeled,Not related,2012_Colorado_wildfires,2012-06-25 23:57:30+00:00,Colorado wildfires,2012-06-08,31,US,Colorado,Climatological,Wildfire,Not applicable
7719,212056699572469760,#telluride #colorado #landscape http://t.co/Y...,Not labeled,Not labeled,Not related,2012_Colorado_wildfires,2012-06-11 05:40:31+00:00,Colorado wildfires,2012-06-08,31,US,Colorado,Climatological,Wildfire,Not applicable
8781,220307872813285376,Anti-Tax movement region of Colorado hit by fi...,Media,Other Useful Information,Related and informative,2012_Colorado_wildfires,2012-07-04 00:07:44+00:00,Colorado wildfires,2012-06-08,31,US,Colorado,Climatological,Wildfire,Not applicable
7984,215821620097462272,USFS forest ecologist: Not just forests out of...,Government,Other Useful Information,Related and informative,2012_Colorado_wildfires,2012-06-21 15:00:58+00:00,Colorado wildfires,2012-06-08,31,US,Colorado,Climatological,Wildfire,Not applicable
8635,218692809920749572,Colorado Fire Relief benefit concerts at Red R...,NGOs,Donations and volunteering,Related and informative,2012_Colorado_wildfires,2012-06-29 13:10:03+00:00,Colorado wildfires,2012-06-08,31,US,Colorado,Climatological,Wildfire,"Sure! Based on the given text, the most appr..."
8072,217059048850325504,"Lady, in her favorite perch above the backyard...",Not labeled,Not labeled,Not related,2012_Colorado_wildfires,2012-06-25 00:58:04+00:00,Colorado wildfires,2012-06-08,31,US,Colorado,Climatological,Wildfire,Not applicable
8516,218229725229887488,@KleinErin seems there are heaps of plane issu...,Outsiders,Other Useful Information,Related and informative,2012_Colorado_wildfires,2012-06-28 06:29:55+00:00,Colorado wildfires,2012-06-08,31,US,Colorado,Climatological,Wildfire,Not applicable


## Performance Evaluation