Approach and Implementation

Step 1: Import state csv into the program by using pandas dataframes
If we are given 325 million synthetic individuals data:
age(19) x gender(2) x ethnicity(6) x income(16) x MSA(2) x state(51) = 372096 total groups

In [None]:
import pandas as pd
state_file = input("Enter the path for the state CSV: ")
vm_file = input("Enter the path for the variable mapping CSV: ")

df = pd.read_csv(state_file)
vm = pd.read_csv(vm_file)
desc_cols = []

for column in df.columns:
    vm_subset = vm[vm['var_id'] == column]

    if not vm_subset.empty:
        value_dict = dict(zip(vm_subset['value_id'], vm_subset['var_values']))
        description = vm_subset['description'].iloc[0].upper()
        df[description] = df[column].map(value_dict)
        desc_cols.append(description)

group_keys = desc_cols + ['loc_msa']

grouped = df.groupby(group_keys)


Step 2: Query LLM for Voting Probability
You can create a function, so that each time when you want to put something in LLM, you call this function

In [None]:
def call_LLM(random_sample, question, choice1, choice2):
    prompt = (
    f"If the person's info is: {random_sample}\n"
    f"What is the probability this person chooses {choice1} "
    f"when choosing between {choice1} and {choice2} for {question}? Only give the probability."
    )
    # Now this prompt is ready to be sent to the LLM
    print(prompt)
    #you can do a user_input here for the result of LLM produced probability
    #eg: prob = float(input("what's LLM returned probability"))
    #return prob



Step 3: randomly select one sample from each group/combination

In [None]:
import random
# a function that whenever import a new state you can use it.
def process_state(df,group_keys, desc_cols, state_name):
    question = input("Enter your question: ")
    choice1 = input("Enter first choice: ")
    choice2 = input("Enter second choice: ")
    for group_key, group_df in df.groupby(group_keys):
        # Randomly pick one row from this group (group_df)
        if not group_df.empty:
            random_row = group_df.sample(n=1).iloc[0]
            demographic_info = ', '.join([str(random_row[col]) for col in desc_cols])
            loc_desc = 'Small town' if random_row['loc_msa'] == 'S' else 'Large city'
            new_prompt = f"{demographic_info}, lives in {loc_desc}, in {state_name}"
            # Send this to the LLM prompt function
            prob = call_LLM(new_prompt, question, choice1, choice2) 
            #since call_LLM returned the probability predicted by LLM, then prob multiply len(group_df) to find total number of sample
            #voting for the first choice (Joe Biden)
    return


Try the program by using variable_mapping.csv and AK.csv
combine step1,2,3 together, and adjust a little bit

In [None]:
import pandas as pd
import random

def call_LLM(random_sample, question, choice1, choice2):
    prompt = (
    f"If the person's info is: {random_sample}\n"
    f"What is the probability this person chooses {choice1} "
    f"when choosing between {choice1} and {choice2} for {question}? Only give the probability."
    )
    # Now this prompt is ready to be sent to the LLM
    print(prompt)
    #you can do a user_input here for the result of LLM produced probability
    #eg: prob = float(input("what's LLM returned probability"))
    #return prob

# a function that whenever import a new state you can use it.
def process_state(grouped, desc_cols, state_name):
    question = input("Enter your question: ")
    choice1 = input("Enter first choice: ")
    choice2 = input("Enter second choice: ")
    for group_key, group_df in grouped:
        # Randomly pick one row from this group (group_df)
        if not group_df.empty:
            random_row = group_df.sample(n=1).iloc[0]
            demographic_info = ', '.join([str(random_row[col]) for col in desc_cols])
            loc_desc = 'Small town' if random_row['loc_msa'] == 'S' else 'Large city'
            new_prompt = f"{demographic_info}, lives in {loc_desc}, in {state_name}"
            # Send this to the LLM prompt function
            prob = call_LLM(new_prompt, question, choice1, choice2) 
            #since call_LLM returned the probability predicted by LLM, then prob multiply len(group_df) to find total number of sample
            #voting for the first choice (Joe Biden)
    return
        
if __name__ == '__main__':
    state_file = input("Enter the path for the state CSV: ")
    vm_file = input("Enter the path for the variable mapping CSV: ")

    df = pd.read_csv(state_file)
    vm = pd.read_csv(vm_file)
    desc_cols = []

    for column in df.columns:
        vm_subset = vm[vm['var_id'] == column]

        if not vm_subset.empty:
            value_dict = dict(zip(vm_subset['value_id'], vm_subset['var_values']))
            description = vm_subset['description'].iloc[0].upper()
            df[description] = df[column].map(value_dict)
            desc_cols.append(description)
            
    group_keys = desc_cols + ['loc_msa']

    grouped = df.groupby(group_keys)

    process_state(grouped, desc_cols, 'Alaska')

If the person's info is: 18 to 19 years, Female, Population of one race: American Indian and Alaska Native, $10 000 to $14 999, lives in Large city, in Alaska
What is the probability this person chooses Joe Biden when choosing between Joe Biden and Donald Trump for Who will this person vote for in the 2024 U.S. Presidential Election? Only give the probability.
If the person's info is: 18 to 19 years, Female, Population of one race: American Indian and Alaska Native, $10 000 to $14 999, lives in Small town, in Alaska
What is the probability this person chooses Joe Biden when choosing between Joe Biden and Donald Trump for Who will this person vote for in the 2024 U.S. Presidential Election? Only give the probability.
If the person's info is: 18 to 19 years, Female, Population of one race: American Indian and Alaska Native, $100 000 to $124 999, lives in Large city, in Alaska
What is the probability this person chooses Joe Biden when choosing between Joe Biden and Donald Trump for Who wi

now we've got this shortened answers, we can pick one to see if this prompt can work and what does LLM return to this prompt:
choose prompt: 
If the person's info is: 18 to 19 years, Female, Population of one race: American Indian and Alaska Native, $10 000 to $14 999, lives in Large city, in Alaska
What is the probability this person chooses Joe Biden when choosing between Joe Biden and Donald Trump for Who will this person vote for in the 2024 U.S. Presidential Election? Only give the probability.

open CHATGPT, and copy the above prompt into CHATGPT, it returns:
0.71
Then you send this probability back into the program inside of each group (the probability can be return to call_LLM), then use this number multiply the number of samples in this group, to find out the total number within the group voting for the first choice(Joe Biden)

For example, this group has 3000 samples, and we get 0.71 from LLM for voting for Biden, so 3000 x 0.71 = 2130 voting for Biden,
So 3000 - 2130 = 870 voting for Trump. 

You repeat those steps to get all groups in one state, then you move on to the next state.


Metric:

To ensure the program works correctly, I checked (1) that prompts were generated for all groups, (2) all predicted probabilities were valid (0–1), (3) vote estimates were calculated as probability * group size, (4) total estimated votes were reasonable relative to population, and (5) sample prompts were reviewed for format accuracy.