# Introduction to LLM model Falcon

Currently, our dataset contains Tweets that should provide precipitation related information as Tweets have been filtered based on keywords/emojies that are related to this field. However, clearly many Tweets contained do not provide sufficient information even for a human to decide if it was "raining"/"not raining" at the location of the Tweeter. 

To build a more robust dataset for training the rain classifier, we would like to train an additional classifier that decides if the Tweet contains "relevant" information that would allow the rain classifier to make an educated guess. To train this "relevance" classifier, we would like to
leverage an LLM that labels our data. Here, we build some prompts that entice the LLM to judge if the Tweet contains information related to the presence of rain/sentence and test how the model reacts.

In addition, code exists that reduces the output files generated by the responses of the LLM.

Note, this notebook requires kernel `ap2_HF-LLM-BnB` instead of `ap2`!

In [1]:
!jupyter kernelspec list

Available kernels:
  ap2               /p/home/jusers/haque1/juwels/.local/share/jupyter/kernels/ap2
  ap2_hf-llm-bnb    /p/home/jusers/haque1/juwels/.local/share/jupyter/kernels/ap2_HF-LLM-BnB
  ap2falcon         /p/home/jusers/haque1/juwels/.local/share/jupyter/kernels/ap2falcon
  bootcamp2022      /p/home/jusers/haque1/juwels/.local/share/jupyter/kernels/bootcamp2022
  python3           /usr/local/share/jupyter/kernels/python3


In [2]:
# allows update of external libraries without need to reload package
%load_ext autoreload
%autoreload 2

In [3]:
import numpy as np
import transformers
import torch
import xarray
import re
import os
import glob
import sys


import a2.utils

import a2.training.training_hugging
import a2.training.evaluate_hugging
import a2.training.dataset_hugging
import a2.plotting.analysis
import a2.plotting.histograms
import a2.dataset

In [4]:
# a2.training.utils_training.gpu_available()
torch.cuda.empty_cache()

In [5]:
FOLDER_MODEL = "/p/project/deepacf/maelstrom/ehlert1/models/falcon-40b"
!ls $FOLDER_MODEL

README.md			  pytorch_model-00005-of-00009.bin
config.json			  pytorch_model-00006-of-00009.bin
configuration_falcon.py		  pytorch_model-00007-of-00009.bin
generation_config.json		  pytorch_model-00008-of-00009.bin
modeling_falcon.py		  pytorch_model-00009-of-00009.bin
pytorch_model-00001-of-00009.bin  pytorch_model.bin.index.json
pytorch_model-00002-of-00009.bin  special_tokens_map.json
pytorch_model-00003-of-00009.bin  tokenizer.json
pytorch_model-00004-of-00009.bin  tokenizer_config.json


In [6]:
FOLDER_DATA = "/p/project/deepacf/maelstrom/haque1/dataset/"
FILE_TWEETS = FOLDER_DATA + "tweets_2017_01_era5_normed_filtered.nc"

In [7]:
!ls -Rtlh $FOLDER_DATA

/p/project/deepacf/maelstrom/haque1/dataset/:
total 9.2G
-rw-r--r-- 1 haque1 deepacf 357M Nov  3 15:06 tweets_2017_era5_normed_filtered.nc
-rw-r--r-- 1 haque1 deepacf  35M Nov  3 14:07 tweets_2017_01_era5_normed_filtered.nc
-rw-r--r-- 1 haque1 deepacf 233M Nov  3 11:41 20_percent_2017_2020_tweets_rain_sun_vocab_emojis_locations_bba_Tp_era5_no_bots_normalized_filtered_weather_stations_fix_predicted_simpledeberta_radar.nc
-rw-r--r-- 1 haque1 deepacf 8.0K Nov  2 09:15 ds_precipitation_10_percent.nc
-rwxr-xr-x 1 haque1 deepacf 2.3G Oct 27 16:48 2017_2020_tweets_rain_sun_vocab_emojis_locations_bba_Tp_era5_no_bots_normalized_filtered_weather_stations_fix_predicted_simpledeberta_radar.nc
-rwxr-xr-x 1 haque1 deepacf 6.4G Oct 27 16:47 ds_precipitation_2017.nc


In [15]:
!ls -Rtlh "/p/project/deepacf/maelstrom/haque1/model/"

/p/project/deepacf/maelstrom/haque1/model/:
total 7.0K
drwxr-sr-x 2 haque1 deepacf 4.0K Nov  3 15:14 checkpoint-2983
drwxr-sr-x 2 haque1 deepacf 4.0K Nov  3 14:38 checkpoint-287
drwxr-sr-x 2 haque1 deepacf 4.0K Nov  3 14:11 checkpoint-919
drwxr-sr-x 2 haque1 deepacf 4.0K Nov  3 11:56 checkpoint-4570
drwxr-sr-x 2 haque1 deepacf 4.0K Nov  3 11:25 checkpoint
drwxr-sr-x 2 haque1 deepacf 4.0K Nov  1 16:43 checkpoint-45698
drwxr-sr-x 3 haque1 deepacf 4.0K Nov  1 12:55 deberta-v3-small

/p/project/deepacf/maelstrom/haque1/model/checkpoint-2983:
total 1.6G
-rw-r--r-- 1 haque1 deepacf  14K Nov  3 15:14 rng_state.pth
-rw-r--r-- 1 haque1 deepacf 1.4K Nov  3 15:14 trainer_state.json
-rw-r--r-- 1 haque1 deepacf 1.1K Nov  3 15:14 scheduler.pt
-rw-r--r-- 1 haque1 deepacf 1.1G Nov  3 15:14 optimizer.pt
-rw-r--r-- 1 haque1 deepacf 4.5K Nov  3 15:14 training_args.bin
-rw-r--r-- 1 haque1 deepacf 8.3M Nov  3 15:14 tokenizer.json
-rw-r--r-- 1 haque1 deepacf 2.4M Nov  3 15:14 spm.model
-rw-r--r-- 1 haque1 d

In [8]:
ds = a2.dataset.load_dataset.load_tweets_dataset(FILE_TWEETS)
ds

In [9]:
ds["relevance_hand"] = (["index"], np.ones_like(ds.index.values))
ds[["text", "raining", "raining_station", "relevance_hand"]].to_pandas().to_csv(
    "tweets_2017_01_era5_normed_filtered.csv"
)

In [10]:
ds["raining"] = (["index"], np.array(ds.tp_mm_station.values > 6e-3, dtype=int))

In [11]:
!cat $FOLDER_MODEL/tokenizer_config.json

{
  "add_prefix_space": false,
  "eos_token": "<|endoftext|>",
  "model_input_names": [
    "input_ids",
    "attention_mask"
  ],
  "model_max_length": 2048,
  "special_tokens_map_file": null,
  "tokenizer_class": "PreTrainedTokenizerFast"
}

In [12]:
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    FOLDER_MODEL, device_map="auto", trust_remote_code=False, quantization_config=bnb_config
)

tokenizer = transformers.AutoTokenizer.from_pretrained(FOLDER_MODEL)
tokenizer.pad_token = tokenizer.eos_token

Loading checkpoint shards:   0%|          | 0/9 [00:00<?, ?it/s]

In [13]:
transformers.__version__

'4.36.1'

## Analyze misclassified Tweets

Let's make predictions with our previously trained model and build a dataset of Tweets that were previously misclassified. We expect a large portion of them to not have sufficient information for the model to accurately estimate the presence of rain. 

In [16]:
folder_model = "/p/project/deepacf/maelstrom/haque1/model/checkpoint-45698"  # change to your models

truth, predictions, prediction_probabilities = a2.training.evaluate_hugging.make_predictions_loaded_model(
    ds, indices_validate=ds.index.values, folder_model=folder_model, key_inputs="text_normalized"
)

miss = truth + predictions
ds_miss = ds.sel(index=np.arange(len(ds.index.values))[miss == 1])
ds_miss

Casting the dataset:   0%|          | 0/24491 [00:00<?, ? examples/s]

Map:   0%|          | 0/24491 [00:00<?, ? examples/s]

Cannot initialize mantik!


### Let's throw out Tweets mentioning snow as it seems to confuse the model as well

In [17]:
ds_no_snow = ds_miss.where(~ds_miss.text_normalized.str.contains("snow", flags=re.IGNORECASE), drop=True)

In [18]:
def tokenize_prompt(prompt):
    return tokenizer.encode(prompt, return_tensors="pt").cuda()

## Simple prompt

In [19]:
prompt = r"""
Does the following sentence provide information on presence of rain? Explain your reasoning.

Sentence: It is raining in London.
"""
input_ids = tokenize_prompt(prompt)

In [20]:
sample_outputs = model.generate(
    input_ids,
    temperature=0.7,
    # do_sample=True,
    max_length=100,
    # top_k=50,
    # top_p=0.95,
    # num_return_sequences=3
)

for i, sample_output in enumerate(sample_outputs):
    prediction = tokenizer.decode(sample_output, skip_special_tokens=True)
    print(f"{prompt=}")
    print(f"---------")
    print(f"prediction\n{prediction}")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


prompt='\nDoes the following sentence provide information on presence of rain? Explain your reasoning.\n\nSentence: It is raining in London.\n'
---------
prediction

Does the following sentence provide information on presence of rain? Explain your reasoning.

Sentence: It is raining in London.

Answer: Yes, because it is raining in London.

Does the following sentence provide information on presence of rain? Explain your reasoning.

Sentence: It is raining in London.

Answer: No, because it is raining in London.

Does the following sentence provide information on presence of rain? Explain your reasoning.




## More complex instructions

In [21]:
# prompt = r"""
# Read below Tweets and tell me if they say that it is raining or sunny.

# Format your answer in a human readable way,

prompt = r"""
Read below Tweets and tell me if they say that it is raining or sunny.
Format your answer in a human readable way,

Tweets:
Tweet 1: "The sound of rain tapping on the window" 
Tweet 2: "Boris likes drinking water". 
"""

example_output = """
Return the results in a json file like: [ 
{ "tweet": 1, "content": "The sound of rain tapping on the window", "explanation": "The sound of rain heard implies that is raining.", "score": 0.9 },  
{ "tweet": 2, "content": "Boris likes drinking water", "explanation": "The Tweet does not mention any information related to presence of rain or sun.", "score": 0.1},
{ "tweet": 3, "content": ... 
] 

Result: [ { "tweet": 1, "content":"""

tweets = ds_no_snow["text_normalized"][np.random.choice(np.arange(len(ds_no_snow["text_normalized"].values)), 5)].values
# ds_no_snow["text_normalized"].values[120:124]


def string_tweets(tweets):
    string = ""
    for i_tweet, t in enumerate(tweets):
        string += f'Tweet {i_tweet + 3}: "{t}"\n'
    return string

In [22]:
%%time


def string_tweets(tweets):
    string = ""
    for i_tweet, t in enumerate(tweets):
        string += f'Tweet {i_tweet + 3}: "{t}"\n'
    return string


def generate_prediction(args, tokenizer, model, prompt, tweets, example_output):
    full_prompt = prompt + string_tweets(tweets) + example_output
    input_ids = tokenize_prompt(full_prompt)

    sample_outputs = model.generate(
        input_ids,
        temperature=0.9,
        # do_sample=True,
        max_length=650,
        top_k=50,
        # top_p=0.95,
        # num_return_sequences=3
    )

    for i, sample_output in enumerate(sample_outputs):
        prediction = tokenizer.decode(sample_output, skip_special_tokens=True)
        print(f"{prompt=}")
        print(f"---------")
        print(f"prediction\n{prediction}")
        return prediction

CPU times: user 12 µs, sys: 3 µs, total: 15 µs
Wall time: 31.9 µs


In [23]:
ds_no_snow = ds.where(~ds.text_normalized.str.contains("snow", flags=re.IGNORECASE), drop=True)

In [None]:
prompt = r"""
Read below Tweets and tell me if they say that it is raining or sunny. It should be rainy or sunny now.
Format your answer in a human readable way,

Tweets:
Tweet 1: "The sound of rain tapping on the window" 
Tweet 2: "Boris likes drinking water".
Tweet 3: "Rain is my imaginary love language, it rains always in my eyes"
"""

example_output = """
Return the results in a json file like: [ 
{ "tweet": 1, "content": "The sound of rain tapping on the window", "explanation": "The sound of rain heard implies that is raining.", "score": 0.9 },  
{ "tweet": 2, "content": "Boris likes drinking water", "explanation": "The Tweet does not mention any information related to presence of rain or sun.", "score": 0.1},
{ "tweet": 3, "content": ... 
] 

Result: [ { "tweet": 1, "content":"""
n_samples = 10
N_start = 5000
args = None
tweets = ds_no_snow["text_normalized"].values[slice(N_start, N_start + n_samples)]

for tweet_sample in np.array_split(tweets, len(tweets) // 5):
    prediction = generate_prediction(args, tokenizer, model, prompt, tweet_sample, example_output)
    with open("dump_relevance.csv", "a") as fd:
        fd.write(prediction)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


prompt='\nRead below Tweets and tell me if they say that it is raining or sunny. It should be rainy or sunny now.\nFormat your answer in a human readable way,\n\nTweets:\nTweet 1: "The sound of rain tapping on the window" \nTweet 2: "Boris likes drinking water". \n'
---------
prediction

Read below Tweets and tell me if they say that it is raining or sunny. It should be rainy or sunny now.
Format your answer in a human readable way,

Tweets:
Tweet 1: "The sound of rain tapping on the window" 
Tweet 2: "Boris likes drinking water". 
Tweet 3: "Loved everyones dedication coming to class in the rain this morning"
Tweet 4: "Loved last years city breaks Amsterdam and Rome but this year its time for a beach holiday with lots of sun sea and sand"
Tweet 5: "Loved noahsarkzoofarm today Cold but lovely sunny day to walk round and see the animals I was"
Tweet 6: "Loved seeing friends and fam this Xmas NYE break but so ready to get back to sunny Doha get in the gym and smash through January"
Tweet 

## Further ideas for prompts ...
Have a look at the following prompts and eperiment with your own.
You may als vary the `temperature`, which increases the creativity of the model if its replies are too "conservative".
Additionally, we have `top_k`, `top_p` and `num_return_sequences` that may help your prompts "succeed".

In [None]:
prompt = r'Assign a probability that it is raining to the following tweets. The content of the tweets should hint something about the weather being rainy and not good in general. Tweet 1: "Well that last rumble of thunder made the house shake, I wasn\'t scared for a couple of seconds" Tweet 2: "#viewfromthe office what a great morning @Grantham and District". Return the results in a json file like: [ { "tweet": 1, "content": "Well that last rumble of thunder made the house shake, I wasn\'t scared for a couple of seconds", "explanation": "short explanation of the rain probability based on content of this tweet", "rain_probability": x.x }, ... ] Result: [ { "tweet": 1, "content":'
prompt = r"""Assign a probability that it is raining to the following tweets. The content of the tweets should hint something about the weather being rainy and not good in general. 
Tweet 1: "The sound of rain tapping on the window" 
Tweet 2: "Boris seems desperate for the rain to finish". 
Return the results in a json file like: [ { "tweet": 1, "content": "The sound of rain tapping on the window", "explanation": "The sound of rain heard implies that is raining.", "rain_probability": 0.9 },  { "tweet": 2, "content": "Boris seems ... ] 
Result: [ { "tweet": 1, "content":"""
prompt = r"""
Assign a probability that it is raining or not rainy to the following tweets, where 1 corresponds to rainy conditions, and 0 to not rainy/sunny conditions. 
In addition to that, assign a score that quantifies the certainty of your assessment based on the content of the tweet, regardless of whether it is rainy or not. 
Thus, for example, if a tweet doesn't mention anything weather-related, that score should be 0. 
If the tweet mentions a sunny day, the probability for rain should be close to 0, and the certainty should be close to 1 as the tweet includes explicit information about the weather. 
The content of the tweets should hint something about the weather being rainy and not good in general. 
List all tweets even if the scores and probabilities are 0. In addition, provide a brief explanation of your assessments for each tweet.

Format your answer in a human readable way,

Tweets:
Tweet 1: "The sound of rain tapping on the window" 
Tweet 2: "Boris seems desperate for the rain to finish". 
"""

## Reduce outputs

In [None]:
f = open("/p/project/training2330/a2/data/bootcamp2023/relevance/dump_relevance_5000.csv", "r")
file = f.read()

In [None]:
def extract_from_file(file):
    full_tweet_replies = re.findall('(\{ "tweet": [3-9].+\})', file)
    CONTENTS = []
    EXPLANATIONS = []
    SCORES = []
    for reply in full_tweet_replies:
        content = re.findall('"content": "(.+)", "expl', reply)
        explanation = re.findall('"explanation": "(.+)", "score', reply)
        score = re.findall('"score": ([0-9]+\.[0-9]+)[\s]{0,1}\}', reply)
        CONTENTS.append(content)
        EXPLANATIONS.append(explanation)
        SCORES.append(score)

    return np.array(CONTENTS).reshape(-1), np.array(EXPLANATIONS).reshape(-1), np.array(SCORES, float).reshape(-1)


def add_file_content(SCORES, EXPLANATION, CONTENTS):
    for c, e, s in zip(contents, explanations, scores):
        index = np.where(ds.text_normalized.values == c)[0]
        if len(index) != 1:
            print(f"{index=}, couldn't match {c}.")
            print(f"{s=}")
            continue
        index = index[0]
        CONTENTS[index] = c
        EXPLANATION[index] = e
        SCORES[index] = s
    return SCORES, EXPLANATION, CONTENTS

In [None]:
!ls /p/scratch/deepacf/unicore-jobs/*/dump_relevance.csv

In [None]:
complesDay3/06NewModels.ipynbRELEVANCE_FILES = glob.glob("/p/scratch/deepacf/unicore-jobs/*/dump_relevance.csv")
print(f"{RELEVANCE_FILES=}")
SCORES = np.ones_like(ds.index.values, dtype=float) * -1
EXPLANATION = np.empty_like(ds.index.values, dtype=object)
CONTENTS = np.empty_like(ds.index.values, dtype=object)
Day3/06NewModels.ipynb
for filename in RELEVANCE_FILES:
    f = open(filename, "r")
    file = f.read()
    contents, explanations, scores = extract_from_file(file)
    SCORES, EXPLANATION, CONTENTS = add_file_content(SCORES, EXPLANATION, CONTENTS)
ds["relevance_score"] = (["index"], SCORES)
ds["explanation"] = (["index"], EXPLANATION)
ds["contents"] = (["index"], CONTENTS)

In [None]:
plt.hist(scores, bins=100);

In [None]:
ds["relevance_score"].plot.hist()

In [None]:
ds.to_netcdf(
    "/p/project/training2330/a2/data/bootcamp2023/relevance/tweets_2017_01_era5_normed_filtered_relevance_score.nc"
)

In [None]:
ds.where(ds.relevance_score"] 