# Generating LLM Perspectives from GlobalOpinionQA Dataset
The paper [Towards Measuring the Representation of Subjective Global Opinions in Language Models](https://huggingface.co/datasets/Anthropic/llm_global_opinions) introduces the GlobalOpinionQA dataset containing survey questions about global issues and opinions from participants from many countries.

To convert the ordinal multiple choice answers into free text answers, we will use LLMs to generate perspectives based on the aggregate statistics in the dataset and save as a new dataset.

In [1]:
from dotenv import load_dotenv
import pandas as pd, numpy as np, os

# Load environment variables
load_dotenv()
TEMP_PATH = os.getenv('TEMP_PATH')
DATA_PATH = os.getenv('DATA_PATH')

In [2]:
from datasets import load_dataset

ds = load_dataset("Anthropic/llm_global_opinions")
df = ds['train'].to_pandas()
df.shape

  from .autonotebook import tqdm as notebook_tqdm


(2556, 4)

In [3]:
df.head()

Unnamed: 0,question,selections,options,source
0,When it comes to Germany’s decision-making in ...,"defaultdict(<class 'list'>, {'Belgium': [0.21,...","['Has too much influence', 'Has too little inf...",GAS
1,"Please tell me if you have a very favorable, s...","defaultdict(<class 'list'>, {'Sweden': [0.06, ...","['Very favorable', 'Somewhat favorable', 'Some...",GAS
2,Which statement comes closer to your own views...,"defaultdict(<class 'list'>, {'Australia': [0.0...",['Using overwhelming military force is the bes...,GAS
3,Do you think China will replace the U.S. as th...,"defaultdict(<class 'list'>, {'China (Non-natio...","['Next 10 years', 'Next 20 years', 'Next 50 ye...",GAS
4,"In your opinion, how strong a sense of Islamic...","defaultdict(<class 'list'>, {'Britain': [0.348...","['Very strong', 'Fairly strong', 'Not too stro...",GAS


In [4]:
import ast

# Convert string representation of defaultdict to regular dict
def convert_defaultdict_str(s):
    try:
        # Extract the dictionary part from the defaultdict string
        dict_str = s.split('defaultdict(<class \'list\'>, ')[1][:-1]
        # Parse the dictionary string
        return ast.literal_eval(dict_str)
    except:
        return {}

df['selections'] = df['selections'].apply(convert_defaultdict_str)
df['options'] = df['options'].apply(ast.literal_eval)


In [5]:
df.head()


Unnamed: 0,question,selections,options,source
0,When it comes to Germany’s decision-making in ...,"{'Belgium': [0.21, 0.07, 0.69, 0.03], 'France'...","[Has too much influence, Has too little influe...",GAS
1,"Please tell me if you have a very favorable, s...","{'Sweden': [0.06, 0.4, 0.38, 0.13, 0.03]}","[Very favorable, Somewhat favorable, Somewhat ...",GAS
2,Which statement comes closer to your own views...,"{'Australia': [0.0, 0.1836734693877551, 0.0, 0...",[Using overwhelming military force is the best...,GAS
3,Do you think China will replace the U.S. as th...,{'China (Non-national sample)': [0.05714285714...,"[Next 10 years, Next 20 years, Next 50 years, ...",GAS
4,"In your opinion, how strong a sense of Islamic...","{'Britain': [0.34831460674157305, 0.5393258426...","[Very strong, Fairly strong, Not too strong, N...",GAS


## Structured Generation of LLM Perspectives

In [6]:
from pydantic import BaseModel
from openai import OpenAI
import json

client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))


In [7]:
class CountryPerspective(BaseModel):
    option: str
    perspective: str

class CountryPerspectiveList(BaseModel):
    perspectives: list[CountryPerspective]


def generate_country_perspectives(question: str, options: list[str], country: str, max_retries: int = 1):
    """
    Generate perspectives for the given country explaining their choice of options.
    """
    system_prompt = """You are a helpful assistant that generates multiple perspectives of answers to a question. You will be given a question, a country, and a list of answer options, and you will generate a list of possible answer perspectives from the perspective of people from that country. Make sure you cover all possible perspectives but do not repeat yourself.
    """

    prompt = f"""Question: {question}
    Country: {country}
    Options: {', '.join(options)}
    Now, step by step, outline each broad answer perspective to this question from the perspective of a person from the country and each of the answer options.
    """
    retries = 0
    while retries < max_retries:
        try:
            chat_response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": prompt}
                ],
                temperature=1,
                response_format={
                    'type': 'json_schema',
                    'json_schema': {
                        "name": "CountryPerspectiveList",
                        "schema": CountryPerspectiveList.model_json_schema()
                    }
                }
            )

            result_object = json.loads(chat_response.choices[0].message.content)
            # Validate the response using Pydantic
            validated_response = CountryPerspectiveList.model_validate(result_object)
            return [p.perspective for p in validated_response.perspectives]
        except Exception as e:
            retries += 1
            if retries == max_retries:
                print(f"Error for {country} - {question} (after {retries} retries): {str(e)}")
                return [f"Error generating perspective: {str(e)}"]
            print(f"Attempt {retries} failed, retrying...")
            

In [8]:
def generate_perspectives(df, total_length=None):
    if total_length is None or total_length > len(df):
        total_length = len(df)
    
    # Create a new dataframe with just the rows we need
    working_df = df.head(total_length).copy()
    
    perspectives = []
    for _, row in working_df.iterrows():
        countries = list(row['selections'].keys())
        options = [str(x) for x in row['options']],
        country_perspectives = {}
        for country in countries:
            country_perspectives[country] = generate_country_perspectives(
                row['question'],
                options, 
                country
            )
        perspectives.append(country_perspectives)
    
    # Only modify the rows we processed
    working_df['country_perspectives'] = perspectives
    return working_df

This is soooOOOooo slooOOoowWWww. Lets get this parallelized!

In [9]:
from concurrent.futures import ThreadPoolExecutor
# Parallel processing with single progress bar
from tqdm.auto import tqdm

# Split the dataframe into chunks
chunk_size = 15 
total_rows = 150
chunks = np.array_split(df.head(total_rows), 10)

# Function to process each chunk
def process_chunk(chunk_df):
    return generate_perspectives(chunk_df, total_length=len(chunk_df))


# Process chunks in parallel using ThreadPoolExecutor with a single progress bar
with ThreadPoolExecutor(max_workers=10) as executor:
    perspectives_chunks = list(tqdm(
        executor.map(process_chunk, chunks),
        total=len(chunks),
        desc="Processing chunks"
    ))

# Combine results
df_with_perspectives = pd.concat(perspectives_chunks, ignore_index=True)


  return bound(*args, **kwds)
Processing chunks:   0%|          | 0/10 [00:00<?, ?it/s]

In [99]:
# Sample 5 questions and display their details
sample_df = df_with_perspectives.head()

# Display the results in a readable format by printing the question and each of the country perspectives
for index, row in sample_df.iterrows():
    print(row['question'])
    for country, perspectives in row['country_perspectives'].items():
        print(f"\t{country}: {'; '.join(perspectives)}")


When it comes to Germany’s decision-making in the European Union, do you think Germany has too much influence, has too little influence or has about the right amount of influence?
	Belgium: Many people in Belgium feel that Germany's dominant economy gives it a disproportionate amount of sway in EU decisions, often sidelining smaller nations, including Belgium.; Some believe that Germany's strong leadership in the EU can lead to policies that primarily benefit German interests, such as economic stability and trade, sometimes at the expense of other member states.; Conversely, there are those in Belgium who argue that while Germany is influential, it should wield even more power in the EU to ensure strong leadership, especially in times of crisis like economic downturns or global challenges.; A common viewpoint is that Germany plays an essential role in the EU, and its influence is balanced by the need for consensus among member states, ensuring that all voices are heard.; Some Belgians 

In [94]:
# To keep things separate and clean, we're going to save these to a different file.
df_with_perspectives.to_csv(DATA_PATH+f'GlobalOpinionQA_with_LLM_generated_perspectives_{total_rows}.csv', index=False) 