# LLM Stance Detection
This notebook contains code for the first attempt at LLM Chain of Stance (COS) stance labeling.

## Imports
Necessary imports.

In [None]:
# install/update relevant packages
# !pip install --upgrade transformers accelerate torch --quiet

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline # setting up our LLM
import torch # setting up our LLM
from huggingface_hub import login # authentication
import re # stance extraction
import pandas as pd
from tqdm import tqdm
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
import pyarrow.parquet as pq
import pyarrow as pa
import gc
import glob

## Model Setup

### Choice of LLM
The model used is [Ministral-8B-Instruct-2410](https://huggingface.co/mistralai/Ministral-8B-Instruct-2410):
1. Mistral's 7-B model outperformed similarly sized Qwen1.5 and LLaMA3 models in leading [Chain of Stance research](https://arxiv.org/pdf/2408.04649).
2. It is Mistral's newest and most powerful model under 10B parameters.
3. It supports a 128k context window, better for long-form and recent data.
   
**Note**: You must agree to the Mistral AI Research License for access and use it for research/non-commercial purposes.

### Authentication
1. Sign up for a Hugging Face account: [https://huggingface.co/join](https://huggingface.co/join)
2. Request access to Ministral-8B-Instruct-2410: https://huggingface.co/mistralai/Ministral-8B-Instruct-2410
3. Insert your own access token below (replace "hf_xxx" with your actual token):

In [None]:
# authentication to use model
ACCESS_TOKEN = "hf____"

login(ACCESS_TOKEN)

### Load Model and Tokenizer

In [None]:
model_id = "mistralai/Ministral-8B-Instruct-2410"

# Load tokenizer and model (uses GPU if available)
tokenizer = AutoTokenizer.from_pretrained(model_id, legacy=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    device_map="auto"
)

### Creating Pipeline
Chaining our model and tokenizer together in a pipeline.

In [None]:
chatbot = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        max_new_tokens=256,
        do_sample=False,  # making output deterministic for reproducibility"
    )

## Chain of Stance Implementation

### Prompting
Function `create_cos_prompt` to generate our Chain of Stance prompt.

Per the Chain of Stance framework, the prompting encourages the model to sequentially consider multiple aspects of the text before making a stance judgement.

Inputs to the function are `text`, `topic`, `source_name` and `title.

We optionally allow `example` to be input. This should be a dictionary containing an example text snippet, along with the desired LLM response.

In [None]:
def create_cos_prompt(text, topic, source_name, title, example=None):
    # this prompt structure guides the LLM through the 6 steps of CoS.
    # the stances are FAVOR, AGAINST, NONE.
    # an example can also be provided for few-shot prompting
    # keys must be 'text', 'topic', 'source_name', 'title' for a given example, and 'answer', the ideal LLM response
    
    main_prompt = f"""You are an expert in stance detection. 
                 Your task is to determine the stance of a given text towards a specific topic. 
                 Follow these steps carefully to provide a complete analysis and a final conclusion.

                **Source Name:** "{source_name}"
                **Title:** "{title}"
                **Text for Analysis:** "{text}"
                **Topic:** "{topic}"
                
                **Step 1: Contextual Information Analysis**
                Analyze the contextual information of the text. 
                Consider the topic, the likely identity of the author, the target audience, and any relevant socio-cultural background.
                
                **Step 2: Main Idea and Viewpoint Identification**
                Based on the text and context, what are the core viewpoints and main intentions being expressed regarding the topic?
                
                **Step 3: Language and Emotional Attitude Analysis**
                Analyze the language, tone, and emotion. 
                Identify emotive words, rhetorical devices, and the author's overall tone (e.g., affirmative, negative, neutral, sarcastic).
                
                **Step 4: Comparison with Possible Stances**
                Compare the text's content and tone against the three possible stances (FAVOR, AGAINST, NONE). 
                For each stance, list evidence from the source (if any) of that stance.
                
                **Step 5: Logical Inference and Consistency Check**
                Synthesize your analysis from all previous steps to make a final decision on the most likely stance expressed in the text from (FAVOR, AGAINST, NONE).
                
                **Step 6: Final Stance Determination**
                 Output the final stance on a new line, in the format 'Final Stance: [STANCE]', where [STANCE] is one of FAVOR, AGAINST, or NONE.
                
                Begin your analysis now.
                """

    # If no example, this is the prompt
    if example is None:
        return "[TASK]\n" + main_prompt.strip() + "\n[/TASK]"

    # Otherwise, format in the example
    example_prompt = f"""You are an expert in stance detection. 
                 Your task is to determine the stance of a given text towards a specific topic. 
                 Follow these steps carefully to provide a complete analysis and a final conclusion.

                **Source Name:** "{example["source_name"]}"
                **Title:** "{example["title"]}"
                **Text for Analysis:** "{example["text"]}"
                **Topic:** "{example["topic"]}"
                
                **Step 1: Contextual Information Analysis**
                Analyze the contextual information of the text. 
                Consider the topic, the likely identity of the author, the target audience, and any relevant socio-cultural background.
                
                **Step 2: Main Idea and Viewpoint Identification**
                Based on the text and context, what are the core viewpoints and main intentions being expressed regarding the topic?
                
                **Step 3: Language and Emotional Attitude Analysis**
                Analyze the language, tone, and emotion. 
                Identify emotive words, rhetorical devices, and the author's overall tone (e.g., affirmative, negative, neutral, sarcastic).
                
                **Step 4: Comparison with Possible Stances**
                Compare the text's content and tone against the three possible stances (FAVOR, AGAINST, NONE). 
                For each stance, list evidence from the source (if any) of that stance.
                
                **Step 5: Logical Inference and Consistency Check**
                Synthesize your analysis from all previous steps to make a final decision on the most likely stance expressed in the text from (FAVOR, AGAINST, NONE).
                
                **Step 6: Final Stance Determination**
                 Output the final stance on a new line, in the format 'Final Stance: [STANCE]', where [STANCE] is one of FAVOR, AGAINST, or NONE.
                
                Begin your analysis now.
                """

    # Compose final multi-shot prompt
    prompt = (
        "[EXAMPLE]\n"
        + "**Example Prompt**\n"
        + example_prompt.strip()
        + "\n**Model Response**\n"
        + example["answer"].strip()
        + "\n[/EXAMPLE]"
        + "\n\n[TASK]\n"
        + main_prompt.strip()
        + "\n[/TASK]"
    )
    return prompt

### Few-Shot Prompting Example
This contains sample text paired with optimal model output, all sourced from Figure 2 of [Chain of Stance: Stance Detection with Large Language Models](https://arxiv.org/pdf/2408.04649) (Ma et al.). This can optionally be provided to the model for better results.

In [None]:
# sample text
example_text = "I love the smell of Hillary in the morning. It smells like Republican Victory. #TedCruz2016 #Marco Rubio"
example_topic = "Hillary Clinton"
example_source = "Twitter"
example_title = "Hillary Clinton Tweet"

# sample response
example_answer = """
**Recognizing Context**
Text Topic: The tweet mentions Hillary Clinton and her political competitors. Specifically, it references Ted Cruz and Marco Rubio, and includes the hashtag #SemST.

Author's Identity: The specific identity of the author is not directly indicated, but it can be inferred that the author is likely a political commentator or an ordinary user with certain political inclinations.

Target Audience: The target audience of the tweet is likely those who follow American politics, especially the internal competition within the Republican Party.

Relevant Socio-Cultural Background: The tweet was posted during the U.S. presidential election period, where Hillary Clinton was a major candidate for the Democratic Party, while Ted Cruz and Marco Rubio were major candidates for the Republican Party.

**Analyzing Main Idea**
Core viewpoint: The text expresses negative feelings towards Hillary Clinton and is optimistic about a Republican victory. 

Main Intention: To express support for the Republican candidates (like Ted Cruz and Marco Rubio) by mocking Hillary Clinton.

**Emotional Analysis**
Language Expression: "I love the smell of Hillary in the morning. It smells like Republican Victory." The phrase uses sarcasm. The word "love" is usually positive, but here it is clearly sarcastic.

Rhetorical Strategy: The author adapts a famous line from the movie "Apocalypse Now" ("I love the smell of napalm in the morning"), changing it to "I love the smell of Hillary in the morning" to mock Hillary Clinton and suggest that her presence will lead to a Republican victory. 

Tone: The overall tone is clearly sarcastic and derogatory.

Emotion: The underlying emotional inclination in the tweet is dissatisfaction with Hillary Clinton and a hope for a Republican victory.

**Stance Reinforcement**
Comparing neutral, favor, and against stances:
Favor Hillary Clinton: There are no signs of support for her in the tweet.
Against Hillary Clinton: The tweet mocks and belittles her, indicating a clear opposition to her.
None: The tweet expresses clear emotions and cannot be neutral.

**Logical Inference**
Combining the contextual information, main idea, emotional attitude and stance reinforcement, the tweet is clearly belittling Hillary Clinton and expressing support for Republican candidates like Ted Cruz and Marco Rubio.

**Stance Determination**
Final Stance: [AGAINST]
""".strip()

example_dict = {"source_name":example_source, "title":example_title, "text":example_text, "topic":example_topic, "answer":example_answer}

### Inference
Function `get_stance_with_cos` combined our CoS prompt with our chatbot pipeline to generate a stance on a given document.

Inputs are the text, topic, source_name, title, example (optional), and model pipeline.

Outputs are two strings:

1. The `stance` (one of "FAVOR", "AGAINST", or "NONE")
2. The full six step `reasoning process` for interpretability.

In [None]:
def get_stance_with_cos(text, topic, source_name, title, chatbot_pipeline, example=None):
    """
    Uses CoS method to get a stance from the LLM, with optional few-shot example.
    """
    # create the full prompt using CoS template
    cos_prompt = create_cos_prompt(text, topic, source_name, title, example=example)
    
    # tag the prompt
    formatted_prompt = f"[INST]{cos_prompt}[/INST]"
    
    response = chatbot_pipeline(formatted_prompt)[0]["generated_text"]
    
    reasoning_text = response.split("[/INST]")[-1].strip()
    
    match = re.search(r"Final Stance:\s*(FAVOR|AGAINST|NONE)", reasoning_text, re.IGNORECASE)
    
    if match:
        final_stance = match.group(1).upper()
    else:
        final_stance = None
        
    return final_stance, reasoning_text

### Test Run
Single test run with sample data to test the model.

In [None]:
# text = "@realDonaldTrump You are not fooling anyone. You're scared, and overwhelmed, and you have absolutely no idea what you're doing. And it shows"
# topic = "Donald Trump"
# source_name = "NBC News"
# title = "Woman's Epic Anti-Trump Twitter Rant Goes Viral"

# stance, reasoning = get_stance_with_cos(text, topic, source_name, title, chatbot, example_dict)

# print("--- STANCE ---")
# print(stance)
# print("\n--- REASONING ---")
# print(reasoning)

## Loading Data
Loading our chunk data with topic labels.

In [None]:
chunks_w_topic_labels = pd.read_csv("data/chunks_for_stance_detection.csv")

In [None]:
dict(chunks_w_topic_labels.dtypes)

In [None]:
topic_cols = [col for col in chunks_w_topic_labels.columns if col.startswith("topic_")]

In [None]:
stance_cols = {f"stance_{col.replace("topic_", "")}": None for col in topic_cols}
reasoning_cols = {f"reasoning_{col.replace("topic_", "")}": None for col in topic_cols}

new_cols_df = pd.DataFrame([stance_cols | reasoning_cols], index=chunks_w_topic_labels.index)
chunks_w_topic_labels = pd.concat([chunks_w_topic_labels, new_cols_df], axis=1)

## Production

In [None]:
def batch_stance_detection_wide(df, chatbot_pipeline, batch_size=8, example_dict=None, output_path="results_batches/"):
    import os
    if not os.path.exists(output_path):
        os.makedirs(output_path)
    rows = df.to_dict("records")
    n = len(rows)
    # Track processed row indices for checkpointing
    processed_batches = []
    for batch_num, batch_start in enumerate(tqdm(range(0, n, batch_size))):
        batch_rows = rows[batch_start:batch_start+batch_size]
        prompts = []
        meta = []
        for row_idx, row in enumerate(batch_rows, start=batch_start):
            for col in topic_cols:
                if row[col]:
                    topic_name = col.replace("topic_", "").replace("_", " ")
                    prompt = create_cos_prompt(
                        row["text"],
                        topic_name,
                        row["source_name"],
                        row["title"],
                        example=example_dict
                    )
                    prompts.append(f"[INST]{prompt}[/INST]")
                    meta.append((row_idx, topic_name))
        if len(prompts) == 0:
            continue
        try:
            results = chatbot_pipeline(prompts)
        except Exception as e:
            print(f"Batch {batch_num} failed: {e}")
            continue
        batch_updates = []
        for (row_idx, topic_name), output in zip(meta, results):
            reasoning_text = output["generated_text"].split("[/INST]")[-1].strip()
            match = re.search(r"Final Stance:\\s*(FAVOR|AGAINST|NONE)", reasoning_text, re.IGNORECASE)
            stance = match.group(1).upper() if match else None
            batch_updates.append({
                "row_idx": row_idx,
                "topic_name": topic_name,
                f"stance_{topic_name}": stance,
                f"reasoning_{topic_name}": reasoning_text
            })
        # Convert to DataFrame and save as parquet
        batch_df = pd.DataFrame(batch_updates)
        batch_df.to_parquet(os.path.join(output_path, f"batch_{batch_num:05d}.parquet"))
        processed_batches.append(batch_num)
        # Optionally, free up memory
        del batch_df, batch_updates, results
        gc.collect()
    # Optionally, return the list of processed batch files
    return processed_batches

In [None]:
batch_size = 8  # or 16 if fits in VRAM

# Run the batch stance detection with incremental writing
processed_batches = batch_stance_detection_wide(
    chunks_w_topic_labels,
    chatbot_pipeline=chatbot,
    batch_size=batch_size,
    example_dict=example_dict,
    output_path="results_batches/"
)

In [None]:
# Read all batch parquet files and concatenate
batch_files = sorted(glob.glob("results_batches/batch_*.parquet"))
all_batches = [pd.read_parquet(f) for f in batch_files]
final_results = pd.concat(all_batches, ignore_index=True)

# Optionally, merge results back to your original DataFrame on row index
for idx, row in final_results.iterrows():
    r_idx = row["row_idx"]
    tname = row["topic_name"]
    chunks_w_topic_labels.at[r_idx, f"stance_{tname}"] = row[f"stance_{tname}"]
    chunks_w_topic_labels.at[r_idx, f"reasoning_{tname}"] = row[f"reasoning_{tname}"]

# Save the final DataFrame
chunks_w_topic_labels.to_parquet("chunks_with_stances_final.parquet")