# Protest Classification Using LLM
In this notebook, the objective is to use a Language Model (LLM) to classify protests into predefined categories. The input data is sourced from ACLED. Each row represents a protest with multiple columns, with the most relevant for classification being the `notes` column, which provides a description of the protest.

## Overview of Approach
1. **LLM Family**: For this, we utilize the OpenAI family of `GPT` models.

2. **Design Classification Prompt and Assess Performance**: Using a manually curated training dataset with labeled protests, we experiment with various prompting strategies and evaluate performance. We also examine the impact of the number of few-shot examples used on results.

3. **Apply Classification to the Dataset**: Once the optimal prompting strategy and number of few-shot examples are determined, we apply the classification approach to the entire dataset. This involves using the refined prompt to categorize each protest event based on its description, ensuring consistency and accuracy across all entries.

## Limitations 
1. **Cost**. It is expensive to run OpenAI models when you have many tokens. This classification task costed around $100
2. **Not utilizing all examples** In order to reduce cost and processing time, we are not utilizing all available examples to perfom the classification



In [1]:
import os
from pathlib import Path

import pandas as pd
import geopandas as gpd

from datetime import datetime

import bokeh
from bokeh.models import Tabs, TabPanel
from bokeh.core.validation.warnings import EMPTY_LAYOUT, MISSING_RENDERERS
from bokeh.plotting import show, output_notebook

from langchain_openai import OpenAI, ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="langchain")


from dotenv import load_dotenv


## Global variables 


In [2]:
# ==================
# SETUP INPUT
# ==================
DIR_DATA =  Path.cwd().parents[1].joinpath("data", "conflict")
FILE_PROTESTS = DIR_DATA.joinpath("protests_iran_20160101_20241009.csv")
FILE_PROTESTS_CLASSIFIED = DIR_DATA.joinpath("protests_sample_acled_iran.csv")
FILE_PROTESTS_CLASSES = DIR_DATA.joinpath("protest_classification.csv")
FILE_PROTESTS_TRAINING = DIR_DATA.joinpath("protests-labeled-sample-training.csv")

# ==================
# CLASSIFICATION 
# ==================
PROP_TRAIN = 0.4
NUM_EXAMPLES = 10
SAMPLE_PROP = 0.5
OPENAI_MODEL = "gpt-3.5-turbo"

# For testing, classify only a portion of the documents
SAMPLE_SIZE = 0.3

## Preprocess Data

In [3]:
df = pd.read_csv(FILE_PROTESTS_TRAINING)

In [6]:
df.category.value_counts()

category
Livelihood (Prices, jobs and salaries)    64
Political/Security                        56
Business and legal                        42
Social                                    26
Public service delivery                   25
Climate and environment                   11
Name: count, dtype: int64

In [7]:
df_prot = pd.read_csv(FILE_PROTESTS, dtype=str)
df_prot_labels = pd.read_csv(FILE_PROTESTS_CLASSIFIED)
df_tmp2 = pd.read_csv(FILE_PROTESTS_CLASSES)

FileNotFoundError: [Errno 2] No such file or directory: '/Users/dunstanmatekenya/Library/CloudStorage/OneDrive-WBG(2)/Data-Lab/iran-economic-monitoring/data/conflict/protest_classification.csv'

In [28]:
description_code = dict(df_tmp2[["code", "description"]].values)
category_code = dict(df_tmp2[["code", "major_category"]].values)

In [29]:
df_prot_labels.rename(columns={'Classification code': "code"}, inplace=True)

In [30]:
df_prot_labels['description'] = df_prot_labels.code.map(description_code)
df_prot_labels['category'] = df_prot_labels.code.map(category_code)

In [31]:
def pretty_print_value_counts(
    df, column, title=None, line_length=None, top_n=None, table_number=None
):
    """
    Pretty prints the value counts of a specified column in a Pandas DataFrame,
    with counts formatted with thousand separators, percentages, and cumulative percentages.

    Parameters:
    -----------
    df : pandas.DataFrame
        The DataFrame containing the data.
    column : str
        The name of the column for which to calculate value counts.
    title : str, optional
        A title to print above the formatted output. If None, no title is printed.
    line_length : int, optional
        The length of the separator line. If None, it will be determined based on
        the length of the title or default to 50 if no title is provided.
    top_n : int, optional
        The number of top categories to display. If None, all categories are displayed.
    table_number : int, optional
        The numeric value for the table number. If provided, the table number will be displayed as 'Table-X'.

    Returns:
    --------
    None
        Displays a styled DataFrame with counts, percentages, and cumulative percentages.
    """
    # Calculate the value counts and convert to DataFrame
    count_df = pd.DataFrame(df[column].value_counts(normalize=False).reset_index())
    count_df.columns = ["Category", "Count"]

    # Add a percentage column
    count_df["Percent"] = (count_df["Count"] / count_df["Count"].sum()) * 100

    # Add a cumulative percentage column
    count_df["Cum. Percent"] = count_df["Percent"].cumsum()

    # Limit the output to top_n categories if specified
    if top_n:
        count_df = count_df.head(top_n)

    # Print the table number if provided
    if table_number is not None:
        print(f"Table-{table_number}")

    # Determine the length of the line if line_length is not provided
    if title:
        if line_length is None:
            line_length = max(
                50, len(title) + 4
            )  # Ensure at least 50 characters, or more based on the title

        # Calculate padding to center the title
        total_padding = line_length - len(title)
        left_padding = total_padding // 2
        right_padding = total_padding - left_padding

        # Print the centered title with the "=" line
        print("=" * line_length)
        print(" " * left_padding + title + " " * right_padding)
        print("=" * line_length)

    # Display the styled DataFrame without index, formatting Count, Percent, and Cumulative Percent columns
    display(
        count_df.style.hide(axis="index").format(
            {
                "Count": "{:,.0f}",  # Thousand separator for Count
                "Percent": "{:.2f}%",  # Format Percent to 2 decimal places with a % symbol
                "Cum. Percent": "{:.2f}%",  # Format Cumulative Percent to 2 decimal places with a % symbol
            }
        )
    )

    # Print footer or separator
    print("-" * line_length)

In [60]:
pretty_print_value_counts(df_prot_labels, "category", 
title="Distribution of Protest Classes", line_length=60)

              Distribution of Protest Classes               


Category,Count,Percent,Cum. Percent
"Livelihood (Prices, jobs and salaries)",64,28.57%,28.57%
Political/Security,56,25.00%,53.57%
Business and legal,42,18.75%,72.32%
Social,26,11.61%,83.93%
Public service delivery,25,11.16%,95.09%
Climate and environment,11,4.91%,100.00%


------------------------------------------------------------


In [33]:
df_prot_labels.to_csv(DIR_DATA.joinpath("protest-with-labels.csv"), 
index=False)

In [None]:
## Check Zero Shot Classification Accuracy

## Check Classification Accuracy 

In [None]:
def gpt3_5_classification(dataset, categories):
    predictions = []
    for note in dataset['notes']:
        # Format the prompt with the note and available categories
        prompt = prompt_template.format(categories=", ".join(categories), note=note)
        response = llm(prompt)
        predictions.append(response.strip())
    
    dataset['gpt3_5_classification'] = predictions
    
    # Calculate accuracy
    accuracy = accuracy_score(dataset['category'], dataset['gpt3_5_classification'])
    return accuracy

### Split the examples into training and test

In [34]:
# Let's perform a stratified split, ensuring that 80% of the examples from each category are used for training

# Split the data while maintaining the distribution of categories
train_df = df_prot_labels.groupby('category', group_keys=False).apply(lambda x: x.sample(frac=PROP_TRAIN, 
random_state=42))
test_df = df_prot_labels.drop(train_df.index)

(train_df.shape, test_df.shape)

  train_df = df_prot_labels.groupby('category', group_keys=False).apply(lambda x: x.sample(frac=PROP_TRAIN,


((89, 6), (135, 6))

In [35]:
pretty_print_value_counts(train_df, "category", 
title="Train-Distribution of Protest Classes", line_length=60)

           Train-Distribution of Protest Classes            


Category,Count,Percent,Cum. Percent
"Livelihood (Prices, jobs and salaries)",26,29.21%,29.21%
Political/Security,22,24.72%,53.93%
Business and legal,17,19.10%,73.03%
Public service delivery,10,11.24%,84.27%
Social,10,11.24%,95.51%
Climate and environment,4,4.49%,100.00%


------------------------------------------------------------


In [36]:
pretty_print_value_counts(test_df, "category", 
title="Test-Distribution of Protest Classes", line_length=60)

            Test-Distribution of Protest Classes            


Category,Count,Percent,Cum. Percent
"Livelihood (Prices, jobs and salaries)",38,28.15%,28.15%
Political/Security,34,25.19%,53.33%
Business and legal,25,18.52%,71.85%
Social,16,11.85%,83.70%
Public service delivery,15,11.11%,94.81%
Climate and environment,7,5.19%,100.00%


------------------------------------------------------------


In [37]:
# Sample up to 5 examples from each category in the training set
examples = []
for category in train_df['category'].unique():
    # Get all samples if fewer than 5 exist, otherwise take 5
    category_samples = train_df[train_df['category'] == category].sample(
        min(NUM_EXAMPLES, len(train_df[train_df['category'] == category])), random_state=42
    )
    examples.extend(category_samples[['notes', 'description', 'category']].values.tolist())

In [38]:
# Step 3: Define a prompt template for classification, adding detailed examples
classification_prompt_template = """
You are a highly intelligent assistant. Your task is to classify each document into one of the following categories:
- Political/Security
- Livelihood (Prices, jobs and salaries)
- Public service delivery
- Business and legal
- Climate and environment
- Social

Each category has a description that helps explain its purpose.

Here are some examples:

{examples}

Given the following document:
{document}

Based on the descriptions and the examples, which category does this document fall into? Please respond with one of the categories listed above.
"""

In [39]:
# Step 4: Format the examples for the prompt, including notes, descriptions, and categories
formatted_examples = "\n".join(
    [
        f"Example {i+1}:\nNotes: {example[0]}\nDescription: {example[1]}\nCategory: {example[2]}"
        for i, example in enumerate(examples)
    ]
)

In [40]:

# Step 2: Initialize the LLM (OpenAI in this case)
llm = ChatOpenAI(model=OPENAI_MODEL, temperature=0.7)

# Step 5: Create a prompt template using Langchain
classification_prompt = PromptTemplate(
    input_variables=["document", "examples"],
    template=classification_prompt_template
)

# Step 6: Create a Langchain with the LLM and the classification prompt template
classification_chain = LLMChain(
    llm=llm,
    prompt=classification_prompt
)

In [8]:
50000/2600

19.23076923076923

In [41]:
# Test with all the 'notes' column from test_df
documents = test_df['notes'].tolist()

# Classify each document and store results
classifications = []
for doc in documents:
    result = classification_chain.run(document=doc, examples=formatted_examples)
    # Clean up the result by removing the "Category: " prefix if it exists
    cleaned_result = result.strip().replace("Category: ", "")
    classifications.append(cleaned_result)  # Store the cleaned classification

# Add the cleaned classification results to the DataFrame as a new column
test_df['classification'] = classifications


# Calculate accuracy
correct_predictions = (test_df['classification'] == test_df['category']).sum()
total_predictions = len(test_df)
accuracy = correct_predictions / total_predictions

# Print the accuracy
print("="*55)
print(f" Accuracy using model: {OPENAI_MODEL} with {NUM_EXAMPLES} Examples")
print("="*55)
print(f"Accuracy: {accuracy * 100:.2f}%")
print("-"*55)

 Accuracy using model: gpt-3.5-turbo with 10 Examples
Accuracy: 87.41%
-------------------------------------------------------


## Classify All Documents 

In [None]:
# df_prot_sample = df_prot.sample(frac=SAMPLE_SIZE)
# for index, row in df_prot_sample.iterrows():
#     try:
#         # Run classification for the "notes" column of the current row
#         result = classification_chain.run(document=row['notes'], examples=formatted_examples)
#         # Clean up the result by removing the "Category: " prefix if it exists
#         cleaned_result = result.strip().replace("Category: ", "")
#         # Store the cleaned classification in the "classification" column of the current row
#         df_prot_sample.at[index, 'classification'] = cleaned_result
#     except Exception as e:
#         print(f"Error processing row {index}: {e}")
#         df_prot_sample.at[index, 'classification'] = None  # Optionally, mark as None if there was an error


In [68]:
def classify_with_retry(document, examples, max_retries=1):
    """
    Classifies the document, retrying if the initial classification is not in the specified categories.
    
    Parameters:
    - document (str): The document to classify.
    - examples (list): Examples to use for classification.
    - max_retries (int): Number of retries allowed if classification is not in specified categories.

    Returns:
    - str: The final classification or 'Failed2Classify' if classification is unsuccessful.
    """
    for _ in range(max_retries + 1):
        result = classification_chain.run(document=document, examples=examples)
        cleaned_result = result.strip().replace("Category: ", "")
        
        if cleaned_result in categories:
            return cleaned_result  # Return if classification is in the categories
    
    # If classification failed after retries, label as "Failed2Classify"
    return "Failed2Classify"

In [69]:
# Classify and ignore rows already classified in df_pro_sample
# Also, classify when value in df_prot_sample == "Failed2Classify"
for index, row in df_prot.iterrows():
    # Check if this 'note' has already been classified in df_prot_sample
    sample_classification = df_prot_sample.loc[
        (df_prot_sample['event_date'] == row['event_date']) &
        (df_prot_sample['source'] == row['source']) &
        (df_prot_sample['admin1'] == row['admin1']) &
        (df_prot_sample['admin2'] == row['admin2']) &
        (df_prot_sample['admin3'] == row['admin3']) &
        (df_prot_sample['notes'] == row['notes'])
    ]['classification']

    if sample_classification.notnull().any():
        # Only reclassify if the current classification is "Failed2Classify"
        if "Failed2Classify" in sample_classification.values:
            # Run classification with retry mechanism
            try:
                classification = classify_with_retry(document=row['notes'], examples=formatted_examples)
                df_prot.at[index, 'classification'] = classification
            except Exception as e:
                print(f"Error processing row {index}: {e}")
                df_prot.at[index, 'classification'] = None  # Optionally, mark as None if there was an error
        else:
            continue  # Skip if already classified
    else:
        # Run classification if not in df_prot_sample
        try:
            classification = classify_with_retry(document=row['notes'], examples=formatted_examples)
            df_prot.at[index, 'classification'] = classification
        except Exception as e:
            print(f"Error processing row {index}: {e}")
            df_prot.at[index, 'classification'] = None


In [77]:
# Merge the dataframes on all columns except "classification"
merged_df = df_prot.merge(
    df_prot_sample,
    on=['event_date', 'source', 'admin1', 'admin2', 'admin3', 'event_type', 'sub_event_type', 
        'interaction', 'fatalities', 'latitude', 'longitude', 'actor1', 'actor2', 'notes'],
    how='left',
    suffixes=('', '_sample')
)

# Update the "classification" column to prioritize non-"Failed2Classify" values from df_prot
merged_df['classification'] = merged_df.apply(
    lambda row: row['classification']
    if pd.notnull(row['classification'])
    else (row['classification_sample'] if row['classification_sample'] != "Failed2Classify" else "Failed2Classify"),
    axis=1
)

# Drop the extra "classification_sample" column
merged_df.drop(columns=['classification_sample'], inplace=True)

# Drop duplicates based on all columns except "classification"
merged_df.drop_duplicates(
    subset=['event_date', 'source', 'admin1', 'admin2', 'admin3', 'event_type', 'sub_event_type', 
            'interaction', 'fatalities', 'latitude', 'longitude', 'actor1', 'actor2', 'notes'],
    inplace=True
)

# Reset index for a clean merged dataframe
merged_df.reset_index(drop=True, inplace=True)


In [79]:
merged_df.shape[0] == df_prot.shape[0]

True

In [78]:
merged_df.classification.value_counts(dropna=False)

classification
Livelihood (Prices, jobs and salaries)    13459
Business and legal                         3369
Social                                     3119
Political/Security                         3109
Public service delivery                     984
Climate and environment                     915
Name: count, dtype: int64

### Preprocess Results


In [54]:
# ===============================================
# REMOVE CATEGORIES NOT IN THE AVAILABLE CLASSES
# ===============================================
categories = list(category_code.values())
df_prot_sample['classification'] = df_prot_sample['classification'].apply(lambda x: x if x in categories else "Failed2Classify")

In [83]:
merged_df.drop(columns=["Unnamed: 0"], inplace=True)

In [85]:
merged_df.to_csv(DIR_DATA.joinpath("protests-labeled-gpt.csv"), index=False)

In [80]:
pretty_print_value_counts(merged_df, "classification", 
"Distribution of Labeled Protests",line_length=60 )

              Distribution of Labeled Protests              


Category,Count,Percent,Cum. Percent
"Livelihood (Prices, jobs and salaries)",13459,53.93%,53.93%
Business and legal,3369,13.50%,67.43%
Social,3119,12.50%,79.93%
Political/Security,3109,12.46%,92.39%
Public service delivery,984,3.94%,96.33%
Climate and environment,915,3.67%,100.00%


------------------------------------------------------------


In [5]:
df = pd.read_csv(DIR_DATA.joinpath("protests-labeled-all-gpt.csv"))