# Long-Context Gemini: Unlocking Psychohistory

## what is Psychohistory
- Psychohistory is a fictional discipline introduced by science fiction author Isaac Asimov in the Foundation series. It combines history, sociology, statistics, and mathematics to analyze the behavior of large groups of individuals and predict trends in collective behavior and the future development of society. 
- While it cannot predict the actions of a single individual, the randomness of individual behaviors cancels out in sufficiently large groups, resulting in predictable patterns of overall societal behavior. 
- Psychohistory aims to provide a scientific basis for policymaking, forecasting social changes, and managing crises through mathematical models and big data analysis to predict the future of societies probabilistically.

## key point of Psychohistory
- Simulation of Individual Psychological Activities and Actions:
Psychohistory enables the simulation of individual human psychological processes and behaviors. By modeling the decision-making, emotions, and actions of numerous individuals, it captures the nuanced dynamics of human behavior in various contexts.
- Aggregation of Individual Behaviors into Collective Psychological and Social Dynamics:
From the vast array of individual psychological activities and actions, psychohistory synthesizes and summarizes the collective psychological state and social behaviors of groups. This aggregation allows for the prediction of large-scale societal trends and events based on the interplay of individual actions.

## why Gemini
- Massive Human Data Training for Simulating Individual Psychological Activities and Actions:
Gemini is trained on an extensive dataset comprising vast amounts of human-generated data. This extensive training enables it to accurately simulate individual psychological processes and behaviors, capturing the nuances of human decision-making, emotions, and actions with high fidelity.
- Support for 1-2 Million Tokens Longcontext Mode to Summarize Collective Psychological Activities and Actions:
Gemini’s longcontext mode, which supports 1-2 million tokens, allows it to process and analyze large-scale data efficiently. This capability enables Gemini to aggregate and summarize massive individual psychological activities and actions, facilitating the derivation of collective psychological states and societal behaviors essential for applications like psychohistory.

## how to implement psychohistory using Gemini
1.	To predict future states, use Gemini-1.5-Pro to insert several intermediate time points between the present and the future. While it is difficult to determine distant events, it is easier to make confident predictions about near-term events.（function：predict_future_sequence）
2.	To predict future states, use Gemini-1.5-Pro to analyze which human physical and psychological states influence these states. Analyze these two sets of states at each time point.（function：predict_future_sequence）
3.	Based on the physical and psychological states of interest, use Gemini-1.5-Pro to design keywords for crawling data from the internet. Perform data crawling based on these keywords.（function：get_news_keywords_for_topic，get_reddit_subreddits_for_topic）
4.	Use Gemini-1.5-Flash to analyze each piece of data for its impact on the states, and use LongContext Gemini to summarize the two sets of attributes (currently representing the attributes of the real world).（function：answer_questions，class：CachedDeriveOverallState）

For loop：With the two sets of attributes from the current real-world state, predict the two sets of attributes for the next time point:
- Based on the current physical and psychological states of the world, use Gemini to simulate individual actions and analyze how these actions impact the next time point.（function：predict_individual_future_actions）
- Aggregate all these impacts into a long text and use Gemini LongContext to summarize them, obtaining the physical attributes for the next time point (analogous to using distance, speed, and time to calculate the next distance).（function：forecast_overall_future_state）
- With the physical attributes of the next time point, use Gemini to simulate news reports and online discussions, effectively simulating psychological activities.（function：simulate_future_individual_behavior）
- Use Gemini-1.5-Flash to analyze each news report or online discussion for its impact on psychological states.（function：answer_questions）
- Aggregate all these impacts into a long text and use Gemini LongContext to summarize them, obtaining the psychological attributes for the next time point.（function：derive_overall_state）
- With both sets of attributes for one time point, move to the next iteration.

## why LongContext Gemini is so important for Psychohistory
- As mentioned earlier, a key aspect of psychohistory is the ability to summarize group states from the psychological activities or actions of numerous individuals. For example, if describing one individual’s psychological activity requires 300 tokens of text, a group of 2,000 people would need a context length of 600,000 tokens. The larger the context window, the larger the group that can be supported. The larger the group, the smaller the impact of individual randomness on the group.
- In the CachedDeriveOverallState class, LongContext Gemini must be used to summarize the physical and psychological states of the real world from massive amounts of internet data.
- In the forecast_overall_future_state function, LongContext Gemini must be used to summarize the behaviors of a massive number of individuals and their impacts on a future time point to derive the physical state at that future time point.
- In the derive_overall_state function, LongContext Gemini must be used to summarize the simulated psychological activities of a massive number of individuals to determine the psychological state at a given time point.


## why caching
The API cost for LongContext is relatively high. When handling large shard prompts and requiring repeated requests, the Context-Caching mode can be used to reduce expenses.
In this project, the Context-Caching mode is implemented in the CachedDeriveOverallState class to cache the vast amount of internet text data. This cached data is then repeatedly used for requests to summarize the current physical and psychological state of society.

## Three execution results

## Limitations
- The data collected from the internet is limited to news snippets and Reddit posts, lacking diversity in sources.
- Experiments with a 2M context length have not yet been conducted.
- Currently, only supports questions like “Will something happen by year XX?” and does not support questions like “When will something happen?”

In [1]:
import os
import json
import google.generativeai as genai
import praw
from datetime import datetime,timedelta 
from newsapi import NewsApiClient
import concurrent.futures
from typing import Any
import threading
from google.generativeai import caching
from multiprocessing import Pool
from functools import partial
import time
import random

# newsapi 是新闻爬取包，使用pip install newsapi-python安装。https://newsapi.org/
# praw 是reddit爬取包。

In [2]:
# reddit api setting
client_id = ""
client_secret = ""
user_agent = ""
reddit = praw.Reddit(
    client_id=client_id,
    client_secret=client_secret,
    user_agent=user_agent
)

# newsapi key setting
news_api_key = ""
newsapi = NewsApiClient(api_key=news_api_key)

# gemini key
genai.configure(api_key=os.environ['API_KEY'])
model = genai.GenerativeModel('gemini-1.5-flash')

usage_lock = threading.Lock()

In [3]:
def read_posts_file(path):
    """
    Reads JSON files containing lists of strings (each string is a combined article or post),
    and returns a concatenated list of strings.

    Parameters:
    - path: either a folder path (string) or a list of file paths (list of strings)
    """
    all_data = []
    if isinstance(path, str):  # It's a folder path
        json_files = [os.path.join(path, f) for f in os.listdir(path) if f.endswith('.json')]
    elif isinstance(path, list):  # It's a list of file paths
        json_files = path
    else:
        raise ValueError("path must be a folder path (string) or a list of file paths (list of strings)")

    for file_path in json_files:
        with open(file_path, 'r', encoding='utf-8') as f:
            data = json.load(f)
            all_data.extend(data)
    return all_data

def get_token_count(text,model):
    try:
        model = genai.GenerativeModel(model)
        response = model.count_tokens(text)
        return response.total_tokens
    except Exception as e:
        print(f"Error counting tokens: {e}")
        return 0

def log_usage_data(op_name, model_name, prompt_tokens, candidates_tokens, cached_content_tokens=None, duration_seconds=None, usage_log_file='usage_log.json'):
    """
    Logs the usage data into a JSON file.

    Parameters:
    - op_name: Name of the llm op.
    - model_name: Name of the model used.
    - prompt_tokens: Number of prompt tokens.
    - candidates_tokens: Number of candidates tokens.
    - cached_content_tokens: Number of cached content tokens (if any).
    - duration_seconds: Duration in seconds (if applicable).
    - usage_log_file: Path to the usage log file.
    """
    usage_data = {
        'op_name': op_name,
        'model_name': model_name,
        'prompt_tokens': prompt_tokens,
        'candidates_tokens': candidates_tokens,
    }
    if cached_content_tokens is not None:
        usage_data['cached_content_tokens'] = cached_content_tokens
    if duration_seconds is not None:
        usage_data['duration_seconds'] = duration_seconds

    with usage_lock:
        if os.path.exists(usage_log_file):
            with open(usage_log_file, 'r', encoding='utf-8') as f:
                data = json.load(f)
        else:
            data = []

        data.append(usage_data)

        with open(usage_log_file, 'w', encoding='utf-8') as f:
            json.dump(data, f, ensure_ascii=False, indent=4)

def select_posts_by_token_limit(json_path, target_tokens, model="gemini-pro"):
    """
    Select posts from a JSON file to limit total tokens to a specified number.

    Parameters:
    - json_path (str): Path to the JSON file containing a list of posts.
    - target_tokens (int): The target token limit.
    - model (str): The name of the model to use for token counting (default: "gemini-pro").

    Returns:
    - list: Selected posts within the token limit.
    """
    # Read the JSON file
    with open(json_path, 'r', encoding='utf-8') as f:
        posts = json.load(f)

    # Concatenate posts and calculate initial token count
    concatenated_posts = "\n".join(posts)
    total_tokens = get_token_count(concatenated_posts, model)
    total_posts = len(posts)

    # print(f"Initial total tokens: {total_tokens}, Total posts: {total_posts}")

    # If tokens exceed the target, randomly sample posts
    if total_tokens > target_tokens * 0.95:
        # Calculate the fraction of posts to sample
        sample_fraction = (target_tokens * 0.90) / total_tokens
        sample_count = max(1, int(sample_fraction * total_posts))

        # print(f"Sampling {sample_count} posts out of {total_posts}")

        # Randomly sample posts
        sampled_posts = random.sample(posts, sample_count)

        # Verify the new token count
        sampled_concatenated_posts = "\n".join(sampled_posts)
        new_token_count = get_token_count(sampled_concatenated_posts, model)

        # print(f"New total tokens after sampling: {new_token_count}")

        # Return sampled posts if within the limit
        if new_token_count <= target_tokens * 0.95:
            return sampled_posts
        else:
            # print(f"Warning: Token count still exceeds {target_tokens * 0.95}. Returning sampled posts.")
            return sampled_posts
    else:
        # If tokens are already within the limit, return all posts
        # print(f"Token count is within limit. Returning all posts.")
        return posts

def get_subreddit_posts(keyword, print_data=False, sort_by='hot', limit=10, max_dialogues_per_comment=5, save_to_file=False, save_path=None):
    """
    Fetches posts from a specified subreddit and optionally saves them to a file.

    Parameters:
    - keyword (str): The name of the subreddit to fetch posts from.
    - print_data (bool, optional): If True, prints the fetched data to the console. Default is False.
    - sort_by (str, optional): Sorting method for posts; either 'hot' or 'new'. Default is 'hot'.
    - limit (int, optional): The maximum number of posts to fetch. Default is 10.
    - max_dialogues_per_comment (int, optional): The maximum number of dialogues to collect per comment thread. Default is 5.
    - save_to_file (bool, optional): If True, saves the fetched data to a file. Default is False.
    - save_path (str, optional): The directory path where the data should be saved. If None, defaults to "collect_data".

    Returns:
    - List[dict]: A list of dictionaries containing information about each fetched post.
    """
    subreddit = reddit.subreddit(keyword)
    
    # Select sorting method based on sort_by parameter
    if sort_by == 'hot':
        posts_iter = subreddit.hot(limit=limit)
    elif sort_by == 'new':
        posts_iter = subreddit.new(limit=limit)
    else:
        raise ValueError("Invalid sort_by value. Use 'hot' or 'new'.")
    
    posts = []
    for post in posts_iter:
        comments = []

        def collect_comments(comment_list, current_depth=0, max_depth=5, dialogues_collected=0, max_dialogues=10):
            result = []
            if current_depth >= max_depth or dialogues_collected >= max_dialogues:
                return result, dialogues_collected

            for comment in comment_list:
                if isinstance(comment, praw.models.MoreComments):
                    continue

                comment_info = {
                    "Comment": comment.body.replace('\n', ' '),
                    "Replies": []
                }
                dialogues_collected += 1

                if dialogues_collected >= max_dialogues:
                    result.append(comment_info)
                    break

                replies, dialogues_collected = collect_comments(
                    comment.replies,
                    current_depth=current_depth+1,
                    max_depth=max_depth,
                    dialogues_collected=dialogues_collected,
                    max_dialogues=max_dialogues
                )
                comment_info["Replies"] = replies
                result.append(comment_info)

                if dialogues_collected >= max_dialogues:
                    break

            return result, dialogues_collected

        post_comments, _ = collect_comments(
            post.comments,
            max_depth=5,
            max_dialogues=max_dialogues_per_comment
        )

        post_info = {
            "Subreddit": post.subreddit.display_name,
            "Title": post.title.replace('\n', ' '),
            "Content": post.selftext.replace('\n', ' '),
            "Upvotes (Score)": post.score,
            "Author": str(post.author),
            "Number of Comments": post.num_comments,
            "Created Date": datetime.utcfromtimestamp(post.created_utc).strftime('%Y-%m-%d %H:%M:%S'),
            "Post URL": post.url,
            "CommentList": post_comments
        }
        posts.append(post_info)
        
        if print_data:
            title_clean = post.title.replace('\n', ' ')
            content_clean = post.selftext.replace('\n', ' ')
            subreddit_name = post.subreddit.display_name
            
            print(f"Subreddit: {subreddit_name}")
            print(f"Post Title: {title_clean}")
            print(f"Post Content: {content_clean}")
            print(f"Upvotes (Score): {post.score}")
            print(f"Post URL: {post.url}")
            print(f"Author: {post.author}")
            print(f"Number of Comments: {post.num_comments}")
            created_time = datetime.utcfromtimestamp(post.created_utc).strftime('%Y-%m-%d %H:%M:%S')
            print(f"Created Date: {created_time}")
            print("Comments:")
            
            def print_comments(comments_list, indent=0):
                for c in comments_list:
                    comment_clean = c['Comment'].replace('\n', ' ')
                    print(" " * indent + f"- {comment_clean}")
                    if c['Replies']:
                        print_comments(c['Replies'], indent + 4)

            print_comments(post_comments)
            print("-" * 50)
    
    if save_to_file:
        if save_path is None:
            save_path = "collect_data"
        os.makedirs(save_path, exist_ok=True)

        date_str = datetime.now().strftime('%Y%m%d')
        filename = f"{save_path}/reddit_{keyword}_{sort_by}_{limit}_{date_str}.json"

        # Create a list to hold the combined post strings
        posts_strings = []

        for post in posts:
            post_string = (
                f"Subreddit: {post['Subreddit']}\n"
                f"Title: {post['Title']}\n"
                f"Content: {post['Content']}\n"
                f"Upvotes (Score): {post['Upvotes (Score)']}\n"
                f"Author: {post['Author']}\n"
                f"Number of Comments: {post['Number of Comments']}\n"
                f"Created Date: {post['Created Date']}\n"
                f"Post URL: {post['Post URL']}\n"
                f"Comments:\n"
            )

            def format_comments(comments_list, indent=0):
                comment_strings = []
                for c in comments_list:
                    comment_str = " " * indent + f"- {c['Comment']}"
                    comment_strings.append(comment_str)
                    if c['Replies']:
                        comment_strings.extend(format_comments(c['Replies'], indent + 4))
                return comment_strings

            comments_strings = format_comments(post['CommentList'])
            post_string += "\n".join(comments_strings)

            # Append the combined post string to the list
            posts_strings.append(post_string)

        # Save the list of post strings to a JSON file
        with open(filename, 'w', encoding='utf-8') as f:
            json.dump(posts_strings, f, ensure_ascii=False, indent=4)
    
    return posts


def get_news_articles(keywords, from_date=None, to_date=None, total_pages=1, print_data=False, page_size=20, save_to_file=False, save_path=None):
    """
    Fetches news articles based on provided keywords and optional date range, and optionally saves them to files.

    Parameters:
    - keywords (List[str]): A list of keywords to search for in news articles.
    - from_date (str, optional): The start date for fetching articles in 'YYYY-MM-DD' format. If None, no start date filter is applied.
    - to_date (str, optional): The end date for fetching articles in 'YYYY-MM-DD' format. If None, no end date filter is applied.
    - total_pages (int, optional): The number of pages to fetch for each keyword. Default is 1.
    - print_data (bool, optional): If True, prints the fetched articles to the console. Default is False.
    - page_size (int, optional): The number of articles per page. Default is 20.
    - save_to_file (bool, optional): If True, saves the fetched articles to files. Default is False.
    - save_path (str, optional): The directory path where the articles should be saved. If None, defaults to "collect_data".

    Returns:
    - List[dict]: A list of dictionaries containing information about each fetched article.
    """
    all_articles = []
    for keyword in keywords:
        # print(f"Fetching articles for keyword: {keyword}")
        articles_for_keyword = []  # Initialize list for current keyword

        for page in range(1, total_pages + 1):
            try:
                # Fetch articles with pagination and date filters
                articles = newsapi.get_everything(
                    q=keyword,
                    language='en',
                    page_size=page_size,
                    page=page,
                    from_param=from_date,
                    to=to_date
                )

                # Add exception handling based on the response status
                if articles['status'] != 'ok':
                    # print(f"Error fetching articles for keyword '{keyword}': {articles.get('message', 'Unknown error')}")
                    continue

                # Process articles as before
                for article in articles['articles']:
                    article_info = {
                        "Source": article['source']['name'],
                        "Author": article['author'],
                        "Title": article['title'],
                        "Description": article['description'],
                        "Content": article['content'],
                        "Published At": article['publishedAt'],
                        "URL": article['url'],
                    }
                    articles_for_keyword.append(article_info)
                    all_articles.append(article_info)

                    if print_data:
                        print(f"Source: {article_info['Source']}")
                        print(f"Author: {article_info['Author']}")
                        print(f"Title: {article_info['Title']}")
                        print(f"Description: {article_info['Description']}")
                        print(f"Content: {article_info['Content']}")
                        print(f"Published At: {article_info['Published At']}")
                        print(f"URL: {article_info['URL']}")
                        print("-" * 50)

            except Exception as e:
                print(f"Exception occurred while fetching articles for keyword '{keyword}', page {page}: {e}")

        # Save articles for the current keyword to a separate file
        if save_to_file:
            if save_path is None:
                save_path = "collect_data"
            os.makedirs(save_path, exist_ok=True)

            date_str = datetime.now().strftime('%Y%m%d')
            filename = f"{save_path}/news_{keyword}_{page_size}_{date_str}.json"

            # Create a list to hold the combined article strings
            articles_strings = []

            for article in articles_for_keyword:
                article_string = (
                    f"Source: {article['Source']}\n"
                    f"Author: {article['Author']}\n"
                    f"Title: {article['Title']}\n"
                    f"Description: {article['Description']}\n"
                    f"Content: {article['Content']}\n"
                    f"Published At: {article['Published At']}\n"
                )
                articles_strings.append(article_string)

            # Save the list of article strings to a JSON file
            with open(filename, 'w', encoding='utf-8') as f:
                json.dump(articles_strings, f, ensure_ascii=False, indent=4)

    return all_articles

In [4]:
def get_news_keywords_for_topic(topic, questions=None):
    """
    Generates a list of broad and general keywords for news article searches related to a given topic.

    Parameters:
    - topic (str): The main topic to generate keywords for.
    - questions (List[str], optional): Additional questions to refine the keyword generation. Default is None.

    Returns:
    - List[str]: A list of broad and general keywords suitable for searching news articles.
    """
    if questions is None:
        questions = []

    questions_text = "\n".join(questions)

    model = genai.GenerativeModel("gemini-1.5-pro-latest")

    # First LLM Request: Analysis and Summarization
    prompt1 = f"""
Given the following topic:
{questions_text}

The above topic require you to search for related information. Please carefully analyze what keywords can be used in search engines to obtain broad, general information related to these topic.

1. **Avoid overly specific terms.** Instead of exact phrases (e.g., "US renewable energy capacity 2023 EIA"), use broader terms (e.g., "US renewable energy").
2. **Focus on thematic keywords**: Extract high-level thematic fields (e.g., "US economy," "US environment," "US technology").
3. **Identify broad keyword combinations**: Consider general keywords that capture the overall intent of the questions without limiting the scope unnecessarily.

Finally, generate a list of **broad and general keyword combinations** suitable for searching across major news databases or online platforms.
"""

    # Generate response for first request
    response1 = model.generate_content(prompt1)
    usage1 = response1.usage_metadata
    input_tokens1 = usage1.prompt_token_count

    # Log token usage for first request
    # print(f"Request: get_news_keywords_for_topic (first call), Tokens used - Input: {input_tokens1}, Output: {usage1.candidates_token_count}, Total: {usage1.total_token_count}")

    # Extract the analysis text
    analysis_text = response1._result.candidates[0].content.parts[0].text

    # Second LLM Request: Extract Keywords as JSON
    prompt2 = f"""
{prompt1}
{analysis_text}

Based on the refined analysis, adjust the keywords to make them broader and more general. Avoid overly specific terms, and ensure each keyword reflects high-level thematic categories (e.g., "US economy," "climate change," "global technology"). 

Return a list of relevant, broad keywords as a JSON array of strings.
"""

    # Define the expected JSON schema
    response_schema = {
        "type": "array",
        "items": {"type": "string"}
    }

    # Generate response for second request
    response2 = model.generate_content(
        prompt2,
        generation_config=genai.GenerationConfig(
            response_mime_type="application/json",
            response_schema=response_schema
        ),
    )
    usage2 = response2.usage_metadata
    input_tokens2 = usage2.prompt_token_count

    # Log token usage for second request
    # print(f"Request: get_news_keywords_for_topic (second call), Tokens used - Input: {input_tokens2}, Output: {usage2.candidates_token_count}, Total: {usage2.total_token_count}")

    # Parse the response content
    text = response2._result.candidates[0].content.parts[0].text
    keywords = json.loads(text)

    # Save prompts and responses
    prompttemp_dir = os.path.join("prompttemp")
    os.makedirs(prompttemp_dir, exist_ok=True)

    with open(os.path.join(prompttemp_dir, 'get_news_keywords_for_topic_prompt1.txt'), 'w', encoding='utf-8') as f:
        f.write("=== Prompt 1 ===\n")
        f.write(prompt1)
        f.write("\n\n=== Response 1 ===\n")
        f.write(analysis_text)

    with open(os.path.join(prompttemp_dir, 'get_news_keywords_for_topic_prompt2.txt'), 'w', encoding='utf-8') as f:
        f.write("=== Prompt 2 ===\n")
        f.write(prompt2)
        f.write("\n\n=== Response 2 ===\n")
        f.write(json.dumps(keywords, indent=4))

    return keywords

def get_reddit_subreddits_for_topic(topic, questions=None):
    """
    Generates a list of relevant Reddit subreddit names based on a given topic.

    Parameters:
    - topic (str): The main topic to find subreddits for.
    - questions (List[str], optional): Additional questions to refine the subreddit search. Default is None.

    Returns:
    - List[str]: A list of subreddit names related to the topic.
    """
    if questions is None:
        questions = []
    
    questions_text = "\n".join(questions)

    model = genai.GenerativeModel("gemini-1.5-pro-latest")

    # First LLM Request: Analysis and Summarization
    prompt1 = f"""
Given the following topics:
{questions_text}

You are searching for relevant user discussions on Reddit about the topics mentioned above. Please carefully analyze the fields involved in these topics and brainstorm related Reddit subreddit names.
	1.	First, write down your analysis to determine the fields or themes related to these topics.
	2.	Brainstorm which subreddits are likely to host discussions on these fields or themes, and list these subreddits.
	3.	Select the best 20 subreddits from this list.
"""

    # Generate response for first request
    response1 = model.generate_content(prompt1)

    # Log token usage for first request
    usage1 = response1.usage_metadata
    input_tokens1 = usage1.prompt_token_count
    # print(f"Request: get_reddit_subreddits_for_topic (first call), Tokens used - Input: {input_tokens1}, Output: {usage1.candidates_token_count}, Total: {usage1.total_token_count}")

    # Extract the analysis text
    analysis_text = response1._result.candidates[0].content.parts[0].text

    # Second LLM Request: Extract Subreddit Names as JSON
    prompt2 = f"""
{prompt1}
{analysis_text}

Please refine these last subreddit names. Ensure that each subreddit is relevant to the topic and questions, and consider any contextual details derived from the analysis.

Return a list of subreddit names as a JSON array of strings.
"""

    # Define the expected JSON schema
    response_schema = {
        "type": "array",
        "items": {"type": "string"}
    }

    # Generate response for second request
    response2 = model.generate_content(
        prompt2,
        generation_config=genai.GenerationConfig(
            response_mime_type="application/json",
            response_schema=response_schema
        ),
    )

    # Log token usage for second request
    usage2 = response2.usage_metadata
    input_tokens2 = usage2.prompt_token_count
    # print(f"Request: get_reddit_subreddits_for_topic (second call), Tokens used - Input: {input_tokens2}, Output: {usage2.candidates_token_count}, Total: {usage2.total_token_count}")

    # Parse the response content
    text = response2._result.candidates[0].content.parts[0].text
    subreddits = json.loads(text)

    # Save prompts and responses
    prompttemp_dir = os.path.join("prompttemp")
    os.makedirs(prompttemp_dir, exist_ok=True)

    with open(os.path.join(prompttemp_dir, 'get_reddit_subreddits_for_topic_prompt1.txt'), 'w', encoding='utf-8') as f:
        f.write("=== Prompt 1 ===\n")
        f.write(prompt1)
        f.write("\n\n=== Response 1 ===\n")
        f.write(analysis_text)

    with open(os.path.join(prompttemp_dir, 'get_reddit_subreddits_for_topic_prompt2.txt'), 'w', encoding='utf-8') as f:
        f.write("=== Prompt 2 ===\n")
        f.write(prompt2)
        f.write("\n\n=== Response 2 ===\n")
        f.write(json.dumps(subreddits, indent=4))

    return subreddits

def filter_post_by_questions(post_string, questions):
    """
    Determines whether the post_string is related to the list of questions.

    Parameters:
    - post_string: The post content string.
    - questions: The list of questions.

    Returns:
    - bool: Indicates whether it is related.
    """

    questions_text = "\n".join(questions)

    model = genai.GenerativeModel("gemini-1.5-flash")

    prompt = f"""
Given the following post:
————————————————————
{post_string}
————————————————————
And the following questions:
————————————————————
{questions_text}
————————————————————
Determine whether the post is related to any of the questions above. Return your answer as a JSON object with a single field "is_related", which is a boolean value.

Ensure that you only output the JSON object and nothing else.
"""

    input_tokens = model.count_tokens(prompt)

    response = model.generate_content(
        prompt,
        generation_config=genai.GenerationConfig(
            response_mime_type="application/json",
            response_schema={
                "type": "object",
                "properties": {
                    "is_related": {"type": "boolean"}
                },
                "required": ["is_related"]
            }
        ),
    )

    usage = response.usage_metadata
    # print(f"Request: filter_post_by_questions, Tokens used - Input: {input_tokens}, Output: {usage.candidates_token_count}, Total: {usage.total_token_count}")

    text = response._result.candidates[0].content.parts[0].text
    result = json.loads(text)
    is_related = result.get("is_related", False)

    prompttemp_dir = os.path.join("prompttemp")
    os.makedirs(prompttemp_dir, exist_ok=True)

    with open(os.path.join(prompttemp_dir, 'filter_post_by_questions_prompt.txt'), 'w', encoding='utf-8') as f:
        f.write("=== Prompt ===\n")
        f.write(prompt)
        f.write("\n\n=== Response ===\n")
        f.write(text)

    return is_related

def predict_future_sequence(user_request):
    """
    Generates a sequence of future time points and key attributes based on the concept of psychohistory.

    This function uses an LLM to analyze the user's request and divides the future time span into significant
    time points. It also identifies important social and psychological attributes that will influence the event
    described in the user request.

    Parameters:
    - user_request (str): The event or topic for which to predict the future sequence.

    Returns:
    - time_points (List[str]): A list of future time points as strings.
    - physical_questions (List[str]): A list of social attributes to consider.
    - psychological_questions (List[str]): A list of group psychological attributes to consider.
    """
    # Capture the current date
    current_date = datetime.now().strftime("%Y-%m-%d")

    # Updated instruction variable
    instruction = f'''
**You are tasked with using "psychohistory" to predict the future of humanity. Psychohistory, a concept introduced by science fiction author Isaac Asimov, predicts future human events by analyzing a large number of individual behaviors and psychological activities.**
Predicting events far in the future is highly challenging, so you will insert a series of critical time points within the long time span from the present to the target event. You will progressively predict the states at these time points step by step until reaching the final time point.
To predict an event, you need to understand certain attributes of the real world. Based on the provided event, analyze step by step the attributes you need to know.
---
### **Process for Dividing Time Points:**
1. **Analysis of the Time Span**
   Write an analysis of the entire period from the current date to the target future date. Based on existing patterns in human behavior, identify major events, policy changes, or natural progression intervals. Then, list a series of more than 20 time points (these time points should not include the starting or target time).
2. **Select Key Time Points**
   From the above candidate time points, select 5 relatively evenly distributed and significant time points.
3. **Add Current and Target Dates**
   Based on the selected time points, add the present date and the target future date to form a comprehensive timeline.
---
### **Process for Determining Key Attributes:**
1. **Analysis of Key Attributes**
   Write a detailed analysis based on the event to be predicted. Analyze which social attributes and group psychological attributes will affect the event.
   - Select **20 social attributes**, focusing on measurable aspects of society (e.g., infrastructure, climate, health metrics).
   - Select **20 group psychological attributes**, focusing on psychological activities of groups, such as attitudes, perceptions, and opinions about various aspects of life or events.
2. **Select Important Attributes**
   Choose the 20 most important attributes from each list, ensuring both social and group psychological attributes are included.
3. **List the Attributes**
   - **Social Attributes**: Provide a list of the selected social attributes.
   - **Group Psychological Attributes**: Provide a list of the selected group psychological attributes.
---

### **Your Task:**

You are now at {current_date}, going to predict the following event:

**{user_request}**
!!!!If the user’s prediction includes regional limitations (e.g., US, Hong Kong), please add these regional limitations to the required attributes(all attributes, even psychological attributes)
Please follow the above processes for dividing time points and determining key attributes.
'''

    # Build the prompt for the first API call
    prompt1 = f"""{instruction}"""

    model = genai.GenerativeModel("gemini-1.5-pro-latest")
    
    try:
        # First API call
        response1 = model.generate_content(prompt1)
        usage1 = response1.usage_metadata
        input_tokens1 = usage1.prompt_token_count
        # print(f"Request: future_time_sequence_planner (first call), Tokens used - Input: {input_tokens1}, Output: {usage1.candidates_token_count}, Total: {usage1.total_token_count}")
        
        # Extract the generated text
        generated_text = response1._result.candidates[0].content.parts[0].text

        # Build prompt2 for refining the analysis
        prompt2 = f"""Prompt: {prompt1}

Response: {generated_text}
_________________________________
Please analyze the above Response, confirm if the social and psychological attributes are correctly classified, and refine them accordingly. Ensure:
1. Social attributes focus on measurable, tangible aspects of society (e.g., infrastructure, environment, health metrics).
2. Psychological attributes focus on group psychological activities, such as perceptions, attitudes, and opinions.
Do not add any word like " (measured through surveys) " in psychological attributes, ppl know how to get these attributes
After refining, ensure each list contains exactly 20 attributes."""

        # Second API call with prompt2
        response2 = model.generate_content(prompt2)
        usage2 = response2.usage_metadata
        input_tokens2 = usage2.prompt_token_count
        # print(f"Request: future_time_sequence_planner (second call), Tokens used - Input: {input_tokens2}, Output: {usage2.candidates_token_count}, Total: {usage2.total_token_count}")

        # Extract the refined analysis
        refined_text = response2._result.candidates[0].content.parts[0].text

        # Updated prompt3
        prompt3 = f"""{instruction}
___________________________
{response1}
{refined_text}
_____________________
Please extract the time points and the two sets of attributes from the last refined content, and output them as a JSON object with three fields:
- "time_points": a list of strings representing the time points.
- "physical_questions": a list of exactly 20 strings, where each string is an individual social attribute.
- "psychological_questions": a list of exactly 20 strings, where each string is an individual group psychological attribute.

Ensure that the time points and attributes are correctly extracted.
- **Use simple present tense** without emphasizing any specific time points.
- **Use straightforward language**.
"""

        # Define the valid JSON Schema for the expected response
        response_schema = {
            "type": "object",
            "properties": {
                "time_points": {
                    "type": "array",
                    "items": {"type": "string"}
                },
                "physical_questions": {
                    "type": "array",
                    "items": {"type": "string"}
                },
                "psychological_questions": {
                    "type": "array",
                    "items": {"type": "string"}
                }
            },
            "required": ["time_points", "physical_questions", "psychological_questions"]
        }

        # Third API call, requesting structured output with the correct schema
        response3 = model.generate_content(
            prompt3,
            generation_config=genai.GenerationConfig(
                response_mime_type="application/json",
                response_schema=response_schema
            ),
        )
        usage3 = response3.usage_metadata
        input_tokens3 = usage3.prompt_token_count
        # print(f"Request: future_time_sequence_planner (third call), Tokens used - Input: {input_tokens3}, Output: {usage3.candidates_token_count}, Total: {usage3.total_token_count}")

        # Extract the time points and questions from response3
        data = json.loads(response3._result.candidates[0].content.parts[0].text)

        # Extract time points and questions
        time_points = data.get('time_points', [])
        physical_questions = data.get('physical_questions', [])[:20]
        psychological_questions = data.get('psychological_questions', [])[:20]

        # Note: The variables 'physical_questions' and 'psychological_questions' now contain attribute names as per the updated requirements.

        # Create the 'prompttemp' directory if it doesn't exist
        prompttemp_dir = os.path.join("prompttemp")
        os.makedirs(prompttemp_dir, exist_ok=True)

        # Save prompt1 and response1
        with open(os.path.join(prompttemp_dir, 'future_time_sequence_planner_prompt1.txt'), 'w', encoding='utf-8') as f:
            f.write("=== Prompt 1 ===\n")
            f.write(prompt1)
            f.write("\n\n=== Response 1 ===\n")
            f.write(response1._result.candidates[0].content.parts[0].text)

        # Save prompt2 and response2
        with open(os.path.join(prompttemp_dir, 'future_time_sequence_planner_prompt2.txt'), 'w', encoding='utf-8') as f:
            f.write("=== Prompt 2 ===\n")
            f.write(prompt2)
            f.write("\n\n=== Response 2 ===\n")
            f.write(response2._result.candidates[0].content.parts[0].text)

        # Save prompt3 and response3
        with open(os.path.join(prompttemp_dir, 'future_time_sequence_planner_prompt3.txt'), 'w', encoding='utf-8') as f:
            f.write("=== Prompt 3 ===\n")
            f.write(prompt3)
            f.write("\n\n=== Response 3 ===\n")
            f.write(response3._result.candidates[0].content.parts[0].text)

        return time_points, physical_questions, psychological_questions

    except Exception as e:
        print(f"An error occurred: {e}")
        return [], [], []
    

def answer_questions(post_string, questions, include_questions_in_output=True, usage_log_file='usage_log.json'):
    """
    Analyzes a given post to determine its impact on a list of attributes.

    This function processes the post content using an LLM to extract insights related to the specified questions.
    It generates an analysis that highlights how the post influences each attribute.

    Parameters:
    - post_string (str): The content of the post to analyze.
    - questions (List[str]): A list of attribute questions to address.
    - include_questions_in_output (bool, optional): Whether to include the questions in the output. Defaults to True.
    - usage_log_file (str, optional): Path to the usage log file. Defaults to 'usage_log.json'.

    Returns:
    - result_post (str): The analysis result as a string.
    """
    model = genai.GenerativeModel("gemini-1.5-flash")

    # Combine all questions into a single string
    questions_text = "\n".join([f"{i+1}. {q}" for i, q in enumerate(questions)])

    # Updated prompt to meet the new requirements
    prompt = f"""
You are an analyst tasked with analyzing and summarizing the current attributes of human society.
You have received the following piece of information. Analyze how it impacts any of the listed attributes and explain the specific impact.
    •   If the information is a news article, focus on whether it provides quantitative analysis for any attributes and include those details.
    •   If it comes from sources like Reddit or similar discussions, determine if there are qualitative trends and describe the observed trends.
    •   Sumary the information with in one sentence first, then provide your analysis directly without reviewing the article.
    •   Avoid including any disclaimers or additional notes. Write with confidence, as others will judge the accuracy of your analysis.
Use attribute names to indicate which attribute you are referring to, and avoid using any numbering. Provide concise, valuable insights, and do not mention attributes you cannot analyze.
——————————————————————
Post Content:
{post_string}
——————————————————————
Questions:
{questions_text}
"""

    try:
        # print("Making API request to 'answer_questions'...")

        # Generate response
        response = model.generate_content(prompt)
        usage = response.usage_metadata
        input_tokens = usage.prompt_token_count
        # print(f"API call to 'answer_questions' completed. Tokens used - Input: {input_tokens}, Output: {usage.candidates_token_count}, Total: {usage.total_token_count}")
        answer = response._result.candidates[0].content.parts[0].text.strip()

        # Save the prompt and response to 'prompttemp/answer_questions_prompts.txt'
        prompttemp_dir = os.path.join("prompttemp")
        os.makedirs(prompttemp_dir, exist_ok=True)
        prompt_file = os.path.join(prompttemp_dir, 'answer_questions_prompts.txt')

        with open(prompt_file, 'a', encoding='utf-8') as f:
            f.write("=== Prompt ===\n")
            f.write(prompt)
            f.write("\n\n=== Response ===\n")
            f.write(answer)
            f.write("\n\n")

        # Log usage data (non-caching request)
        log_usage_data(
            op_name='extract_individual_contributions',
            model_name=model.model_name,
            prompt_tokens=input_tokens,
            candidates_tokens=usage.candidates_token_count,
            usage_log_file=usage_log_file
        )

    except Exception as e:
        print(f"An error occurred: {e}")
        answer = 'Error generating answer'

    # Build the output text based on the flag
    if include_questions_in_output:
        qa_text = answer
    else:
        qa_text = answer

    result_post = f"{qa_text}"

    return result_post


class CachedDeriveOverallState:
    """
    A class that leverages cached content to generate summaries of the overall societal state.

    This class initializes a cache containing a large prompt (`content_string`) to reduce redundant API calls
    when generating summaries based on different question sets. It uses Google's Generative AI models with caching
    capabilities to efficiently produce summaries.

    Parameters:
    - content_string (str): The long text containing numerous pieces of collected information about society.
    - model_name (str, optional): The name of the model to use. Defaults to 'models/gemini-1.5-flash-001'.
    - usage_log_file (str, optional): Path to the usage log file. Defaults to 'usage_log.json'.

    Methods:
    - generate_summary(questions, usage_log_file='usage_log.json'): Generates a summary focusing on the provided social attributes.
    - delete_cache(): Deletes the cached content to free up resources.
    """
    def __init__(self, content_string, model_name='models/gemini-1.5-flash-001', usage_log_file='usage_log.json'):

        self.content_string = content_string

        # Prepare the prompt up to the point to be cached
        cached_prompt = f"""
You are a sociologist, and below is a long text, followed by a series of social attributes you have been recently studying.
The long text contains numerous pieces of collected information about society, each ending with a brief analysis that highlights observations related to the social attributes you are studying.
Based on the analysis of all the information in the long text, write a summary about the current state of society, focusing on the social attributes listed below. The summary should describe the overall state of society.
If you cannot summarize the quantifiable physical attributes you are studying from the long text, then provide reasonable values based on your real-world experience.
————————————————————————————————————————————————————
Long Text:
{self.content_string}
————————————————————————————————————————————————————
"""

        # Use the uploaded file to create cached content
        self.cached_content = caching.CachedContent.create(
            model='models/gemini-1.5-flash-001',
            display_name=model_name,  # used to identify the cache
            contents=[cached_prompt],
            ttl=timedelta(minutes=5),
        )
        # Create the model using the cached content
        self.model = genai.GenerativeModel.from_cached_content(cached_content=self.cached_content)

        # After creating CachedContent
        cache_prompt_tokens = self.cached_content.usage_metadata.total_token_count

        # Log caching storage usage
        log_usage_data(
            op_name='derive_overall_state_cache_storage',
            model_name=model_name,
            prompt_tokens=0,
            candidates_tokens=0,
            cached_content_tokens=cache_prompt_tokens,
            duration_seconds=600,
            usage_log_file=usage_log_file
        )

    def generate_summary(self, questions, usage_log_file='usage_log.json'):
        # Build the prompt with the variable part
        prompt = f"""
The Social Attributes You Are Studying:
{'\n'.join(questions)}
————————————————————————————————————————————————————
Summary of Current State:"""

        # Generate the response
        response = self.model.generate_content(prompt)
        summary = response._result.candidates[0].content.parts[0].text.strip()
        
        # After generating the response
        usage = response.usage_metadata

        # Log usage data
        log_usage_data(
            op_name='derive_overall_state_from_cache',
            model_name=self.model.model_name,
            prompt_tokens=usage.prompt_token_count,
            candidates_tokens=usage.candidates_token_count,
            cached_content_tokens=usage.cached_content_token_count,
            usage_log_file=usage_log_file
        )
        print(f"[longcontext] Tokens used - Prompt: {usage.prompt_token_count}, Candidates: {usage.candidates_token_count}, Cached Content: {usage.cached_content_token_count}")
        return summary

    def delete_cache(self):
        # Delete the cached content
        self.cached_content.delete()

def derive_overall_state(content_string, questions, usage_log_file='usage_log.json'):
    """
    Generates a summary of the overall societal state based on provided content and attributes.

    This function accepts a long string containing various news articles, Reddit posts, and other
    pieces of information about society. It uses a language model to analyze this content and
    produce a concise summary focusing on the specified societal attributes.

    Parameters:
    - content_string (str): The long text containing the collected societal information to be summarized.
    - questions (list of str): A list of societal attributes (questions) to focus on in the summary.
    - usage_log_file (str, optional): The path to the usage log file for logging API usage.
      Defaults to 'usage_log.json'.

    Returns:
    - summary (str): A string containing the summary of the current societal state,
      focusing on the provided attributes. The summary starts with "State: ".
    """

    # Combine all questions into a single string
    questions_text = "\n".join([f"{i+1}. {q}" for i, q in enumerate(questions)])

    # Construct the prompt
    prompt = f"""
You are a sociologist. Below is a long text containing numerous pieces of information about society, each accompanied by a brief analysis. These analyses correspond to certain attributes listed below:

————————————————————————————————————————————————————
The societal attributes you are studying:
{questions_text}
————————————————————————————————————————————————————
Long text:
{content_string}
————————————————————————————————————————————————————

Based on the analyses of all the provided information, write a summary of the current state of society, focusing on the societal attributes you are studying.

The summary must confidently describe the current societal condition, avoiding any speculative language.

Please note:
    1. This summary is intended for readers to understand the current state of society. Therefore, do not include your analyses in the summary. It must solely describe societal conditions.
    2. Do not add any additional “Note:” at the end. Only provide the description of the societal state you are studying.
    3. If you cannot summarize the quantifiable physical attributes you are studying from the long text, then provide reasonable values based on your real-world experience.
    4. Start your summary with “State: “.

State: 
"""

    try:
        # Initialize the model
        model = genai.GenerativeModel("gemini-1.5-flash-001")

        # Generate the summary
        response = model.generate_content(prompt)
        summary = response._result.candidates[0].content.parts[0].text.strip()

        # Print token usage
        usage = response.usage_metadata
        # print(f"API call to 'derive_overall_state' completed. Tokens used - Input: {usage.prompt_token_count}, Output: {usage.candidates_token_count}, Total: {usage.total_token_count}")

        # Log usage data
        log_usage_data(
            op_name='derive_overall_state',
            model_name=model.model_name,
            prompt_tokens=usage.prompt_token_count,
            candidates_tokens=usage.candidates_token_count,
            usage_log_file=usage_log_file
        )
        print(f"[longcontext] Tokens used - Prompt: {usage.prompt_token_count}, "
              f"Candidates: {usage.candidates_token_count}")
        return summary

    except Exception as e:
        print(f"An error occurred: {e}")
        return None


def predict_individual_future_actions(
    physical_state,
    current_time_point,
    post,
    psychological_state,
    future_time_point,
    physical_attributes,
    usage_log_file='usage_log.json'
):
    """
    Simulates an individual's future actions and their impact on observed attributes.

    Parameters:
    - physical_state: String representing the current physical overall state.
    - current_time_point: String representing the current time point.
    - post: String representing the individual post/information.
    - psychological_state: String representing the current psychological state.
    - future_time_point: String representing the future time point.
    - physical_attributes: List of physical attributes we are observing.
    - usage_log_file: String representing the file path where usage logs will be saved.

    Returns:
    - impact_data: Dictionary containing the individual's actions and impact analysis.
    """
    model = genai.GenerativeModel('gemini-1.5-flash')

    # Construct the prompt
    prompt = f"""
You are an ordinary member of society, currently at the time point {current_time_point}, and you have just come across the following piece of information. Based on the societal state and people’s psychological condition, simulate a reasonable reaction and the actions you would take.
Then analyze how your actions would influence the societal attributes you are researching at the future time point {future_time_point}.

Societal attributes you are researching:
{', '.join(physical_attributes)}

The information you just came across:
{post}

The societal state at this time:
{physical_state}

The psychological state of people at this time:
{psychological_state}

Please first write down your possible reactions and actions based on people’s psychological state at this moment.
Then describe how your actions would affect the societal attributes you are researching at the future time point.

Important Notes:
    1.  Individual impact is minimal and well understood. If you believe your actions have no impact on a specific attribute, do not mention that attribute at all.
    2.  If any attributes are influenced, no matter how small, describe the trend of the impact.
    3.  Stay fully immersed in this role. Do not break character. Every word should reflect this role.
    4.  Use concise language. Only focus on key points.
"""

    try:
        # Generate the response with the new generation configuration
        response = model.generate_content(
            prompt,
            generation_config=genai.types.GenerationConfig(
                temperature=1.5
            )
        )
        generated_text = response._result.candidates[0].content.parts[0].text.strip()

        # Get usage data
        usage = response.usage_metadata
        input_tokens = usage.prompt_token_count

        # Log usage data (non-caching request)
        log_usage_data(
            op_name='predict_individual_future_actions',
            model_name=model.model_name,
            prompt_tokens=input_tokens,
            candidates_tokens=usage.candidates_token_count,
            usage_log_file=usage_log_file
        )

        # Ensure prompttemp directory exists
        prompttemp_dir = os.path.join("prompttemp")
        os.makedirs(prompttemp_dir, exist_ok=True)

        # Append prompt and response to the same text file
        with open(os.path.join(prompttemp_dir, 'predict_individual_future_actions_prompts.txt'), 'a', encoding='utf-8') as f:
            f.write("=== Prompt ===\n")
            f.write(prompt)
            f.write("\n\n=== Response ===\n")
            f.write(generated_text)
            f.write("\n\n")

        # Return the generated text as a string
        return generated_text

    except Exception as e:
        print(f"An error occurred: {e}")
        return None
    
def forecast_overall_future_state(content_string, physical_attributes, future_time_point, current_time_point, usage_log_file='usage_log.json'):
    """
    Predicts the future overall state based on individual future actions.

    Parameters:
    - content_string: A long string containing individual future actions.
    - physical_attributes: List of future social attributes.
    - future_time_point: The future time point.
    - current_time_point: The current time point.
    - usage_log_file: Path to the file where usage logs will be saved.

    Returns:
    - A string containing the predicted future overall state.
    """

    # Construct the prompt as per the new requirements
    prompt = f"""
You are a sociologist currently situated at {current_time_point}. You have gathered a list of information.
Each piece of information details a human activity and its impact on the future time point {future_time_point}.
—————————————————————————
The information you have collected:
{content_string}
—————————————————————————
You are particularly focused on the following social attributes at the future time point {future_time_point}:
{', '.join(physical_attributes)}
—————————————————————————
Based on the information you have collected, predict the status of the social attributes you are focusing on at the future time point.

Please note:
    1. Use a confident and assertive tone to describe the future state.
    2. Write the final results, do not include the analytical process in your response.
    3. If the basis for predicting a certain physical attribute is insufficient, please still provide your predicted value based on your experience, as this is very important.
    4. Your prediction should be a concrete, quantified numerical attribute rather than a trend of change in the attribute.”
"""

    try:
        model = genai.GenerativeModel("gemini-1.5-flash")
        response = model.generate_content(prompt)
        usage = response.usage_metadata
        input_tokens = usage.prompt_token_count
        # print(f"API call to 'forecast_overall_future_state' completed. Tokens used - Input: {input_tokens}, Output: {usage.candidates_token_count}, Total: {usage.total_token_count}")
        prediction = response._result.candidates[0].content.parts[0].text.strip()

        # Ensure prompttemp directory exists
        prompttemp_dir = os.path.join("prompttemp")
        os.makedirs(prompttemp_dir, exist_ok=True)

        # Save prompt and response to the same file
        with open(os.path.join(prompttemp_dir, 'forecast_overall_future_state_prompts.txt'), 'a', encoding='utf-8') as f:
            f.write("=== Prompt ===\n")
            f.write(prompt)
            f.write("\n\n=== Response ===\n")
            f.write(prediction)
            f.write("\n\n")
        
        # Log usage data
        log_usage_data(
            op_name='forecast_overall_future_state',
            model_name=model.model_name,
            prompt_tokens=usage.prompt_token_count,
            candidates_tokens=usage.candidates_token_count,
            usage_log_file=usage_log_file
        )
        print(f"[longcontext] Tokens used - Prompt: {usage.prompt_token_count}, "
              f"Candidates: {usage.candidates_token_count}")

    except Exception as e:
        print(f"An error occurred: {e}")
        prediction = 'Error generating prediction'

    return prediction


def simulate_future_individual_behavior(physical_state, current_time_point, post_string, usage_log_file='usage_log.json'):
    """
    Simulates what the author of a given post/news might say or write in the current world state.

    Parameters:
    - physical_state: String describing the current state of the world.
    - current_time_point: String representing the current time point.
    - post_string: String containing the original post or news article.
    - usage_log_file: String representing the path to the usage log file.

    Returns:
    - A string containing the simulated behavior of the same author in the current time.
    """
    prompt = f"""
You are a journalist and a frequent Reddit user. The current time is {current_time_point}.

In the past, you posted the following news article or Reddit post. Please write a news article from the same institution or a Reddit post that you would participate in, based on your personality and values and the current state of the world.
Note:
1） If you posted a news article in the past, write a news article now. If you posted a Reddit post in the past, write a Reddit post now.
The current state of the world is:
{physical_state}.
_______________________
News article or Reddit post you posted in the past:
{post_string}
_______________________
The news article or Reddit post you are writing now:
"""

    try:
        model = genai.GenerativeModel("gemini-1.5-flash")
        response = model.generate_content(prompt)
        usage = response.usage_metadata
        input_tokens = usage.prompt_token_count
        # print(f"API call to 'simulate_future_individual_behavior' completed. Tokens used - Input: {input_tokens}, Output: {usage.candidates_token_count}, Total: {usage.total_token_count}")
        simulation = response._result.candidates[0].content.parts[0].text.strip()

        # Ensure prompttemp directory exists
        prompttemp_dir = os.path.join("prompttemp")
        os.makedirs(prompttemp_dir, exist_ok=True)

        # Save prompt and response
        with open(os.path.join(prompttemp_dir, 'simulate_future_individual_behavior_prompts.txt'), 'a', encoding='utf-8') as f:
            f.write("=== Prompt ===\n")
            f.write(prompt)
            f.write("\n\n=== Response ===\n")
            f.write(simulation)
            f.write("\n\n")

        # Log usage data
        log_usage_data(
            op_name='simulate_future_individual_behavior',
            model_name=model.model_name,
            prompt_tokens=input_tokens,
            candidates_tokens=usage.candidates_token_count,
            usage_log_file=usage_log_file
        )
        
    except Exception as e:
        print(f"An error occurred: {e}")
        simulation = 'Error generating simulation'

    return simulation

def summarize_prediction(question, world_state, current_time, usage_log_file='usage_log.json'):
    """
    Answers a question based on the current world state and time.

    Parameters:
    - question: The question to be answered.
    - world_state: The current world state description.
    - current_time: The current time point.
    - usage_log_file: The file path to save the usage log.

    Returns:
    - answer: The answer to the question.
    """

    # Build the prompt
    prompt = f"""
You are a sociologist in the year {current_time}. Imagine that you are sitting in a long-abandoned library, untouched for decades. 
Among the dust-covered books and old records, you discover a question written on a piece of paper left behind by someone from the past.

Based on the current world state, please answer the question confidently and decisively, without any hesitation. Imagine the person who wrote the question hoped that someone like you would answer it one day. Please ensure your answer is clear, concise, and reflects the present state of the world.

World State:
{world_state}

Question found in the abandoned library:
{question}

Your Answer:
"""

    try:
        model = genai.GenerativeModel("gemini-1.5-pro")
        response = model.generate_content(
            prompt,
            generation_config=genai.types.GenerationConfig(
                temperature=0.7
            )
        )
        usage = response.usage_metadata
        input_tokens = usage.prompt_token_count
        print(f"API call to 'summarize_prediction' completed. Tokens used - Input: {input_tokens}, Output: {usage.candidates_token_count}, Total: {usage.total_token_count}")
        answer = response._result.candidates[0].content.parts[0].text.strip()

        # Ensure 'prompttemp' directory exists
        prompttemp_dir = os.path.join("prompttemp")
        os.makedirs(prompttemp_dir, exist_ok=True)

        # Save prompt and response
        with open(os.path.join(prompttemp_dir, 'summarize_prediction_prompts.txt'), 'a', encoding='utf-8') as f:
            f.write("=== Prompt ===\n")
            f.write(prompt)
            f.write("\n\n=== Response ===\n")
            f.write(answer)
            f.write("\n\n")

        # Log usage data (non-caching request)
        log_usage_data(
            op_name='summarize_prediction',
            model_name=model.model_name,
            prompt_tokens=input_tokens,
            candidates_tokens=usage.candidates_token_count,
            usage_log_file=usage_log_file
        )

    except Exception as e:
        print(f"An error occurred: {e}")
        answer = 'Error generating answer'

    return answer


In [5]:

def process_and_save_individual_contributions(input_paths, questions, output_file, usage_log_file='usage_log.json'):
    """
    Reads posts from input paths, processes each post by calling the 'answer_questions' function in parallel,
    and writes the processed contributions to the output file.

    Parameters:
    - input_paths: List of file paths or directories to read posts from.
    - questions: List of questions to process each post.
    - output_file: Path to the output file where processed contributions will be saved.
    - usage_log_file: Path to the log file where usage data will be saved.
    """
    posts = read_posts_file(input_paths)
    total_posts = len(posts)
    posts = posts[:2000]
    def process_post(idx_post):
        idx, post = idx_post
        # print(f"Processing post {idx}/{total_posts} in Step 1")
        if isinstance(post, dict):
            post_string = "\n".join(f"{key}: {value}" for key, value in post.items())
        else:
            post_string = str(post)
        result_post = answer_questions(post_string, questions, usage_log_file=usage_log_file)
        return result_post

    with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
        processed_posts = list(executor.map(process_post, enumerate(posts, start=1)))

    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(processed_posts, f, ensure_ascii=False, indent=4)

    # print(f"Processed contributions saved to {output_file}")

def process_overall_state(
    input_data,
    questions,
    output_file=None,
    usage_log_file='usage_log.json'
):
    """
    Processes the overall state by summarizing individual contributions and generating a summary.

    This function reads input data (which can be a file path or data), selects posts based on a token limit,
    concatenates them into a single string, and then uses the `derive_overall_state` function to generate
    a summary of the current state focusing on the provided questions.

    Parameters:
    - input_data: Path to the input file or the actual data to be processed.
    - questions: List of strings representing the questions or attributes to focus on in the summary.
    - output_file (str, optional): If provided, the summary will be saved to this file path.
    - usage_log_file (str, optional): Path to the usage log file for logging API usage. Defaults to 'usage_log.json'.

    Returns:
    - summary (str): A string containing the summary of the current state.
    """
    target_token_limit = 1000000
    selected_posts = select_posts_by_token_limit(input_data, target_token_limit, model="gemini-1.5-flash")
    content_string = "\n\n".join(selected_posts)
    summary = derive_overall_state(content_string, questions, usage_log_file=usage_log_file)

    if output_file:
        with open(output_file, 'w', encoding='utf-8') as f:
            json.dump(summary, f, ensure_ascii=False, indent=4)

    return summary

def summarize_current_state(contributions_text, physical_questions, psychological_questions, physical_output_file, psychological_output_file, usage_log_file='usage_log.json'):
    """
    Summarizes the current physical and psychological state based on individual contributions.

    This function takes the processed individual contributions, creates a cached state for efficient processing,
    and generates summaries for both physical and psychological attributes by invoking `derive_overall_state`.
    The summaries are saved to the specified output files.

    Parameters:
    - contributions_text: List of strings containing the processed individual contributions.
    - physical_questions: List of strings representing the physical attributes to focus on.
    - psychological_questions: List of strings representing the psychological attributes to focus on.
    - physical_output_file (str): Path to the file where the physical state summary will be saved.
    - psychological_output_file (str): Path to the file where the psychological state summary will be saved.
    - usage_log_file (str, optional): Path to the usage log file for logging API usage. Defaults to 'usage_log.json'.

    Returns:
    - None
    """
    contributions_text = '\n\n\n\n'.join(contributions_text)
    cached_state = CachedDeriveOverallState(contributions_text, usage_log_file=usage_log_file)

    # print("Starting derive_overall_state with caching for physical_questions...")
    physical_state_summary = cached_state.generate_summary(physical_questions, usage_log_file=usage_log_file)
    with open(physical_output_file, 'w', encoding='utf-8') as f:
        json.dump(physical_state_summary, f, ensure_ascii=False, indent=4)
    # print(f"Physical overall state saved to {physical_output_file}")

    # print("Starting derive_overall_state with caching for psychological_questions...")
    psychological_state_summary = cached_state.generate_summary(psychological_questions, usage_log_file=usage_log_file)
    with open(psychological_output_file, 'w', encoding='utf-8') as f:
        json.dump(psychological_state_summary, f, ensure_ascii=False, indent=4)
    # print(f"Psychological overall state saved to {psychological_output_file}")

    cached_state.delete_cache()


def process_and_save_individual_future_actions(
    physical_state_file,
    psychological_state_file,
    current_time_point,
    posts_file,
    future_time_point,
    physical_attributes,
    output_file,
    usage_log_file='usage_log.json'
):
    """
    Reads inputs from files, processes individual future actions in parallel, and writes the output to a file.

    Parameters:
    - physical_state_file: Path to the physical overall state JSON file.
    - psychological_state_file: Path to the psychological overall state JSON file.
    - current_time_point: The current time point as a string.
    - posts_file: Path to the JSON file containing posts.
    - future_time_point: The future time point as a string.
    - physical_attributes: List of physical attributes we are observing.
    - output_file: Path to the output JSON file.
    - usage_log_file: Path to the usage log file.
    """

    # Read physical state
    with open(physical_state_file, 'r', encoding='utf-8') as f:
        physical_state = json.load(f)

    # Read psychological state
    with open(psychological_state_file, 'r', encoding='utf-8') as f:
        psychological_state = json.load(f)

    # Read posts
    with open(posts_file, 'r', encoding='utf-8') as f:
        posts = json.load(f)
    posts = posts[:2000]
    total_posts = len(posts)

    def process_post(idx_post):
        idx, post = idx_post
        # print(f"Processing post {idx}/{total_posts} in Step 3")
        if isinstance(post, dict):
            post_string = "\n".join(f"{key}: {value}" for key, value in post.items())
        else:
            post_string = str(post)
        result_post = predict_individual_future_actions(
            physical_state,
            current_time_point,
            post_string,
            psychological_state,
            future_time_point,
            physical_attributes,
            usage_log_file=usage_log_file
        )
        return result_post

    with concurrent.futures.ThreadPoolExecutor(max_workers=40) as executor:
        processed_posts = list(executor.map(process_post, enumerate(posts, start=1)))

    # Save the results
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(processed_posts, f, ensure_ascii=False, indent=4)

    # print(f"Processed future actions saved to {output_file}")

def process_forecast_overall_future_state_with_io(
    input_file,
    physical_attributes,
    future_time_point,
    current_time_point,
    output_file,
    usage_log_file='usage_log.json'
):
    """
    Reads individual future actions from an input file, processes them, and writes the future overall state to the output file.

    Parameters:
    - input_file: Path to the input file containing individual future actions.
    - physical_attributes: List of social attributes to focus on.
    - future_time_point: The future time point.
    - current_time_point: The current time point.
    - output_file: Path to the output file where the future overall state will be saved.
    - usage_log_file: Path to the file where usage logs will be saved.
    """

    # Use the token selection utility to filter posts
    target_token_limit = 1000000  # Set a target token limit slightly below 1 million
    selected_posts = select_posts_by_token_limit(input_file, target_token_limit, model="gemini-1.5-flash")

    # Concatenate selected posts into a single string with separators
    content_string = '\n\n'.join(selected_posts)

    # Call forecast_overall_future_state
    # print("Starting forecast_overall_future_state...")
    start_time = time.time()
    future_state = forecast_overall_future_state(
        content_string,
        physical_attributes,
        future_time_point,
        current_time_point,
        usage_log_file=usage_log_file
    )
    end_time = time.time()
    elapsed_time = end_time - start_time
    # print(f"forecast_overall_future_state execution time: {elapsed_time:.2f} seconds")

    # Write the future state to the output file
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(future_state, f, ensure_ascii=False, indent=4)

def process_and_save_simulated_future_behavior(
    physical_state_file,
    current_time_point,
    input_paths,
    output_file,
    usage_log_file='usage_log.json'
):
    """
    Reads inputs from files, processes future individual behaviors in parallel, and writes the output to a file.

    Parameters:
    - physical_state_file: Path to the physical overall state JSON file.
    - current_time_point: The current time point as a string.
    - input_paths: List of file paths or directories to read posts from.
    - output_file: Path to the output JSON file.
    - usage_log_file: Path to the usage log file.
    """

    # Read physical state
    with open(physical_state_file, 'r', encoding='utf-8') as f:
        physical_state = json.load(f)

    # Read posts
    posts = read_posts_file(input_paths)
    total_posts = len(posts)
    posts = posts[:2000]
    def process_post(idx_post):
        idx, post = idx_post
        # print(f"Processing post {idx}/{total_posts} in Step 5")
        if isinstance(post, dict):
            post_string = "\n".join(f"{key}: {value}" for key, value in post.items())
        else:
            post_string = str(post)
        result_post = simulate_future_individual_behavior(
            physical_state,
            current_time_point,
            post_string,
            usage_log_file=usage_log_file
        )
        return result_post

    with concurrent.futures.ThreadPoolExecutor(max_workers=40) as executor:
        processed_posts = list(executor.map(process_post, enumerate(posts, start=1)))

    # Save the results
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(processed_posts, f, ensure_ascii=False, indent=4)

    # print(f"Simulated future posts saved to {output_file}")

def process_and_save_summarized_prediction(
    question,
    world_state_file,
    current_time_point,
    output_file,
    usage_log_file='usage_log.json'
):
    """
    Reads the world state from a file, processes the summary, and writes the answer to an output file.

    Parameters:
    - question: The user's question.
    - world_state_file: Path to the JSON file containing the world state.
    - current_time_point: The current time point as a string.
    - output_file: Path to the output file where the summary will be saved.
    - usage_log_file: Path to the usage log file.
    """

    # Read world state
    with open(world_state_file, 'r', encoding='utf-8') as f:
        world_state = json.load(f)

    # Process the summary
    answer = summarize_prediction(
        question=question,
        world_state=world_state,
        current_time=current_time_point,
        usage_log_file=usage_log_file
    )

    # Save the answer to the output file
    with open(output_file, 'w', encoding='utf-8') as f:
        f.write(answer)

    print(f"Summary answer saved to {output_file}")

def process_filter_posts(post_strings, questions, num_processes=4):
    """
    Filters a list of post strings, retaining only those posts related to the provided questions.

    Parameters:
    - post_strings (list of str): A list containing multiple post content strings.
    - questions (list of str): A list of questions used to determine the relevance of each post.
    - num_processes (int, optional): The number of parallel processes to use. Defaults to 4.

    Returns:
    - list of str: A filtered list of post strings that are relevant to the provided questions.

    Functionality:
    This function utilizes multiprocessing to filter the given list of post strings, retaining only those posts that are related to the provided list of questions.
    
    Steps:
    1. Use the `partial` function to bind the `filter_post_by_questions` function with the fixed `questions` parameter.
    2. Create a multiprocessing pool with a size of `num_processes`.
    3. Apply the `filter_post_by_questions` function to each post string to determine its relevance.
    4. Based on the results, filter and retain only the relevant posts.
    """
    
    partial_is_post_related = partial(filter_post_by_questions, questions=questions)

    with Pool(num_processes) as pool:
        results = pool.map(partial_is_post_related, post_strings)

    filtered_posts = [post for post, is_related in zip(post_strings, results) if is_related]

    num_filtered = len(filtered_posts)
    num_original = len(post_strings)
    # (f"Filtered {num_filtered} posts out of {num_original} original posts.")

    return filtered_posts

In [6]:
def psychohistory(data_dir, topic, crawl_data=True, filter_data=True, use_existing_sequence=False, reddit_limit=25, reddit_max_dialogues_per_comment=3, news_page_size=100):

    os.makedirs(data_dir, exist_ok=True)

    print("\n=== Step 1: Predicting future sequence ===")
    future_questions_file = os.path.join(data_dir, 'future_questions.json')

    if use_existing_sequence:
        # print("\n=== Loading existing future sequence ===")
        if os.path.exists(future_questions_file):
            with open(future_questions_file, 'r', encoding='utf-8') as f:
                data = json.load(f)
            time_points = data.get('time_points', [])
            physical_questions = data.get('physical_questions', [])
            psychological_questions = data.get('psychological_questions', [])
            combined_questions = data.get('combined_questions', [])
            # print(f"Loaded future questions from {future_questions_file}")
        else:
            print(f"Error: {future_questions_file} does not exist. Cannot load existing future sequence.")
            return
    else:
        time_points, physical_questions, psychological_questions = predict_future_sequence(topic)
        if not time_points:
            print("Error: Failed to generate time points. Exiting the program.")
            return
        combined_questions = physical_questions + psychological_questions
        

        with open(future_questions_file, 'w', encoding='utf-8') as f:
            json.dump(
                {
                    'time_points': time_points,
                    'physical_questions': physical_questions,
                    'psychological_questions': psychological_questions,
                    'combined_questions': combined_questions
                },
                f,
                ensure_ascii=False,
                indent=4
            )
        data = {
            'time_points': time_points,
            'physical_questions': physical_questions,
            'psychological_questions': psychological_questions,
            'combined_questions': combined_questions
        }
        # print(f"Generated future questions and saved to {future_questions_file}")
    print("Time points:", time_points)

    if crawl_data:
        print("\n=== Step 2: Getting subreddits and news keywords ===")
        subreddits = get_reddit_subreddits_for_topic(topic, combined_questions)
        subreddits = [subreddit.strip().lower().lstrip('r/').lstrip('/r/') for subreddit in subreddits]

        news_keywords = get_news_keywords_for_topic(topic, combined_questions)

        print("\n=== Step 3: Crawling data ===")
        crawled_data_dir = os.path.join(data_dir, 'crawled_data')
        os.makedirs(crawled_data_dir, exist_ok=True)

        for subreddit in subreddits:
            for sort_by in ['hot', 'new']:
                try:
                    get_subreddit_posts(
                        keyword=subreddit,
                        print_data=False,
                        sort_by=sort_by,
                        limit=reddit_limit,
                        max_dialogues_per_comment=reddit_max_dialogues_per_comment,
                        save_to_file=True,
                        save_path=crawled_data_dir
                    )
                except Exception as e:
                    print(f"Failed to download data from subreddit '{subreddit}' sorted by '{sort_by}'. Error: {e}")

        try:
            get_news_articles(
                keywords=news_keywords,
                from_date=None,
                to_date=None,
                total_pages=1,
                print_data=False,
                page_size=news_page_size,
                save_to_file=True,
                save_path=crawled_data_dir
            )
        except Exception as e:
            print(f"Failed to download news articles. Error: {e}")

    print("\n=== Step 4: Combining and selecting data ===")
    crawled_data_dir = os.path.join(data_dir, 'crawled_data')
    data_files = [os.path.join(crawled_data_dir, f) for f in os.listdir(crawled_data_dir) if f.endswith('.json')]
    
    combined_posts_file = os.path.join(data_dir, 'combined_posts.json')
    all_posts = read_posts_file(data_files)

    with open(combined_posts_file, 'w', encoding='utf-8') as f:
        json.dump(all_posts, f, ensure_ascii=False, indent=4)
    print(f"Combined posts saved to {combined_posts_file}")
    
    if filter_data:
        print("\n=== Step 5: Filtering posts ===")
        filtered_posts = process_filter_posts(all_posts, combined_questions)

        filtered_posts_file = os.path.join(data_dir, 'filtered_posts.json')
        with open(filtered_posts_file, 'w', encoding='utf-8') as f:
            json.dump(filtered_posts, f, ensure_ascii=False, indent=4)

        # print(f"Filtered posts saved to {filtered_posts_file}")
    else:
        print("\n=== Step 5: Loading combined posts from file ===")
        if os.path.exists(combined_posts_file):
            with open(combined_posts_file, 'r', encoding='utf-8') as f:
                all_posts = json.load(f)
            # print(f"Loaded combined posts from {combined_posts_file}")
        else:
            print(f"Error: {combined_posts_file} does not exist. Please run with filter_data=True first.")
            return


    print("\n=== Step 6: Extracting initial individual contributions ===")
    processed_contributions_file = os.path.join(data_dir, 'processed_contributions.json')
    process_and_save_individual_contributions(
        input_paths=[combined_posts_file],
        questions=combined_questions,
        output_file=processed_contributions_file
    )
    # print(f"Processed contributions saved to {processed_contributions_file}")

    print("\n=== Step 7: Deriving initial overall state ===")
    physical_questions = data.get('physical_questions', [])
    psychological_questions = data.get('psychological_questions', [])

    time_point_0_dir = os.path.join(data_dir, "time_point_0")
    os.makedirs(time_point_0_dir, exist_ok=True)

    physical_state_file = os.path.join(time_point_0_dir, 'physical_overall_state.json')
    psychological_state_file = os.path.join(time_point_0_dir, 'psychological_overall_state.json')

    with open(processed_contributions_file, 'r', encoding='utf-8') as f:
        contributions_text = json.load(f)
    summarize_current_state(
        contributions_text=contributions_text,
        physical_questions=physical_questions,
        psychological_questions=psychological_questions,
        physical_output_file=physical_state_file,
        psychological_output_file=psychological_state_file,
        usage_log_file='usage_log.json'
    )
    
    # print(f"Physical overall state saved to {physical_state_file}")
    # print(f"Psychological overall state saved to {psychological_state_file}")

    input_posts_file = combined_posts_file  # Initial posts file
    current_time_point = time_points[0]
    num_time_points = len(time_points)

    for i in range(1, num_time_points):
        future_time_point = time_points[i]

        print(f"\n=== Processing Time Point {i}/{num_time_points - 1} ({future_time_point}) ===")

        time_point_dir = os.path.join(data_dir, f"time_point_{i}")
        os.makedirs(time_point_dir, exist_ok=True)

        print("\n--- Step 1: Predicting individual future actions ---")
        future_actions_file = os.path.join(time_point_dir, 'future_actions_posts.json')
        process_and_save_individual_future_actions(
            physical_state_file=physical_state_file,
            psychological_state_file=psychological_state_file,
            current_time_point=current_time_point,
            posts_file=input_posts_file,
            future_time_point=future_time_point,
            physical_attributes=physical_questions,
            output_file=future_actions_file
        )

        print("\n--- Step 2: Forecasting overall future physical state ---")
        future_physical_state_file = os.path.join(time_point_dir, f'future_overall_physical_state_{future_time_point}.json')
        process_forecast_overall_future_state_with_io(
            input_file=future_actions_file,
            physical_attributes=physical_questions,
            future_time_point=future_time_point,
            current_time_point=current_time_point,
            output_file=future_physical_state_file
        )
        # print(f"Future physical state saved to {future_physical_state_file}")

        print("\n--- Step 3: Simulating future individual behavior ---")
        simulated_future_posts_file = os.path.join(time_point_dir, f'simulated_future_posts_{future_time_point}.json')
        process_and_save_simulated_future_behavior(
            physical_state_file=future_physical_state_file,
            current_time_point=future_time_point,
            input_paths=[input_posts_file],
            output_file=simulated_future_posts_file,
            usage_log_file='usage_log.json'
        )
        # print(f"Simulated future posts saved to {simulated_future_posts_file}")

        print("\n--- Step 4: Deriving psychological overall state ---")
        psychological_state_file = os.path.join(time_point_dir, f'psychological_overall_state_{future_time_point}.json')
        process_overall_state(
            input_data=simulated_future_posts_file,
            questions=psychological_questions,
            output_file=psychological_state_file
        )
        # print(f"Psychological overall state saved to {psychological_state_file}")

        # Update for next iteration
        input_posts_file = simulated_future_posts_file
        physical_state_file = future_physical_state_file
        current_time_point = future_time_point

    print("\n=== Final Step: Answering the user's question ===")
    question = topic
    final_physical_state_file = physical_state_file  # Last updated physical state file
    answer_output_file = os.path.join(data_dir, 'summary_answer.txt')

    process_and_save_summarized_prediction(
        question=question,
        world_state_file=final_physical_state_file,
        current_time_point=current_time_point,
        output_file=answer_output_file,
        usage_log_file='usage_log.json'
    )

    print("\nAnswer:")
    with open(answer_output_file, 'r', encoding='utf-8') as f:
        answer = f.read()
    print(answer)


In [7]:

# Define a list of questions and their corresponding data directories
question_directory_map = {
    "What level of development will artificial intelligence reach in the United States by 2050?": "tmp/AI/",
    "What level of advancement will SpaceX’s space technology reach by 2050?": "tmp/spacex/",
    "What will the political polarization between the two parties in the United States develop into by 2050?": "tmp/politics/",
    # Add more questions and directories as needed
}

In [13]:
selection = 1 # What level of development will artificial intelligence reach in the United States by 2050?
selected_question = list(question_directory_map.keys())[selection - 1]
data_dir = question_directory_map[selected_question]
topic = selected_question
psychohistory(data_dir, topic, crawl_data=True, filter_data=False, use_existing_sequence=False)


=== Step 1: Predicting future sequence ===
Time points: ['2024', '2028', '2033', '2038', '2043', '2048', '2050']

=== Step 2: Getting subreddits and news keywords ===

=== Step 3: Crawling data ===
Failed to download data from subreddit 'artificialintelligence' sorted by 'hot'. Error: received 404 HTTP response
Failed to download data from subreddit 'artificialintelligence' sorted by 'new'. Error: received 404 HTTP response


  "Created Date": datetime.utcfromtimestamp(post.created_utc).strftime('%Y-%m-%d %H:%M:%S'),



=== Step 4: Combining and selecting data ===
Combined posts saved to tmp/AI/combined_posts.json

=== Step 5: Loading combined posts from file ===

=== Step 6: Extracting initial individual contributions ===

=== Step 7: Deriving initial overall state ===
[longcontext] Tokens used - Prompt: 151773, Candidates: 1443, Cached Content: 151510
[longcontext] Tokens used - Prompt: 151852, Candidates: 804, Cached Content: 151510

=== Processing Time Point 1/6 (2028) ===

--- Step 1: Predicting individual future actions ---

--- Step 2: Forecasting overall future physical state ---
[longcontext] Tokens used - Prompt: 920496, Candidates: 392

--- Step 3: Simulating future individual behavior ---

--- Step 4: Deriving psychological overall state ---
[longcontext] Tokens used - Prompt: 903916, Candidates: 874

=== Processing Time Point 2/6 (2033) ===

--- Step 1: Predicting individual future actions ---

--- Step 2: Forecasting overall future physical state ---
[longcontext] Tokens used - Prompt: 

In [9]:
selection = 2 # What level of advancement will SpaceX’s space technology reach by 2050?
selected_question = list(question_directory_map.keys())[selection - 1]
data_dir = question_directory_map[selected_question]
topic = selected_question
psychohistory(data_dir, topic, crawl_data=True, filter_data=False, use_existing_sequence=False)


=== Step 1: Predicting future sequence ===
Time points: ['2024-12-01', '2028', '2032', '2036', '2040', '2044', '2050']

=== Step 2: Getting subreddits and news keywords ===

=== Step 3: Crawling data ===


  "Created Date": datetime.utcfromtimestamp(post.created_utc).strftime('%Y-%m-%d %H:%M:%S'),


Failed to download data from subreddit 'commercialspaceflight' sorted by 'hot'. Error: received 404 HTTP response
Failed to download data from subreddit 'commercialspaceflight' sorted by 'new'. Error: received 404 HTTP response
Failed to download data from subreddit 'eli5' sorted by 'hot'. Error: received 403 HTTP response
Failed to download data from subreddit 'eli5' sorted by 'new'. Error: received 403 HTTP response

=== Step 4: Combining and selecting data ===
Combined posts saved to tmp/spacex/combined_posts.json

=== Step 5: Loading combined posts from file ===

=== Step 6: Extracting initial individual contributions ===

=== Step 7: Deriving initial overall state ===
[longcontext] Tokens used - Prompt: 167811, Candidates: 2306, Cached Content: 167575
[longcontext] Tokens used - Prompt: 167836, Candidates: 1326, Cached Content: 167575

=== Processing Time Point 1/6 (2028) ===

--- Step 1: Predicting individual future actions ---

--- Step 2: Forecasting overall future physical sta

In [10]:
selection = 3 # What will the political polarization between the two parties in the United States develop into by 2050?
selected_question = list(question_directory_map.keys())[selection - 1]
data_dir = question_directory_map[selected_question]
topic = selected_question
psychohistory(data_dir, topic, crawl_data=True, filter_data=False, use_existing_sequence=False)


=== Step 1: Predicting future sequence ===
Time points: ['2024', '2028', '2033', '2038', '2043', '2048', '2050']

=== Step 2: Getting subreddits and news keywords ===

=== Step 3: Crawling data ===


  "Created Date": datetime.utcfromtimestamp(post.created_utc).strftime('%Y-%m-%d %H:%M:%S'),



=== Step 4: Combining and selecting data ===
Combined posts saved to tmp/politics/combined_posts.json

=== Step 5: Loading combined posts from file ===

=== Step 6: Extracting initial individual contributions ===

=== Step 7: Deriving initial overall state ===
[longcontext] Tokens used - Prompt: 196240, Candidates: 924, Cached Content: 195986
[longcontext] Tokens used - Prompt: 196328, Candidates: 709, Cached Content: 195986

=== Processing Time Point 1/6 (2028) ===

--- Step 1: Predicting individual future actions ---

--- Step 2: Forecasting overall future physical state ---
[longcontext] Tokens used - Prompt: 874047, Candidates: 630

--- Step 3: Simulating future individual behavior ---

--- Step 4: Deriving psychological overall state ---
[longcontext] Tokens used - Prompt: 901121, Candidates: 516

=== Processing Time Point 2/6 (2033) ===

--- Step 1: Predicting individual future actions ---

--- Step 2: Forecasting overall future physical state ---
[longcontext] Tokens used - Pro