# Supplementary Documentation Building
This notebook was used to create content from the repo folders (scripts and analytical outputs) to be used in compiling a comprehensive supplementary document. The document should summarize the key analytical and diagnostic results of the various analyses, extending and supporting the presentation of the main paper.

## Libraries
The following libraries were used.

In [5]:
import csv
import os
import pandas as pd

## Summarize the R Script
Here, we are trying to summarize the structure of the R script. The script itself is already extensively documented with clear section headings. So, we will exctract on the doc strings and use that to build the framework for a markdown document section (part of the supplementary document) that summarizes the analytical workflow in the script providing clear headers, context, inputs, and outputs for the main script sections.

Extract analysis pipline from doc strings in the R script file:

In [None]:
# Python script to extract comments (lines starting with '#') from an R script
def extract_comments(input_path, output_path):
    try:
        with open(input_path, "r") as infile, open(output_path, "w") as outfile:
            for line in infile:
                stripped_line = line.lstrip()
                if stripped_line.startswith("#"):  # Identify comment lines
                    outfile.write(line)  # Write comment to output file
        print(f"Extraction successful! Comments saved to {output_path}")
    except Exception as e:
        print(f"An error occurred: {e}")

# Specify the input R script and output text file
input_path = "Src/urban_wealth_scale.R"
output_path = "Src/code_docs_outline.txt"

# Run the extraction
extract_comments(input_path, output_path)

Extraction successful! Comments saved to Src/code_docs_outline.txt


## Output File Table
To make it easier for reviewers and readers to navigate the project repo, we will construct a summary table of the analytical outputs produced by the R script. There are around 190 output files, comprising both numerical and plot outputs. Here we will just grab the filenames from the Output folder and write them to a csv for further automated and manual processing.

Get all the output filenames for processing:

In [None]:
# Define the directory and output CSV filename
output_folder = "../Output/"
csv_filename = "output_filenames.csv"

# Get all filenames in the Output folder
file_list = os.listdir(output_folder)

# Write filenames to a CSV file
with open(csv_filename, mode='w', newline='') as csvfile:
    csv_writer = csv.writer(csvfile)
    # Write the header
    csv_writer.writerow(["Filename"])
    # Write each filename as a row
    for filename in file_list:
        csv_writer.writerow([filename])

print(f"Filenames from '{output_folder}' written to '{csv_filename}'")

Filenames from 'Output/' written to 'Output/output_filenames.csv'


Now, we need to take in the csv of filenames, add context based on the filename conventions used for the analysis, sort the files, and create a markdown table based on the sorted filenames with metadata:

In [None]:
# Load the CSV file containing the filenames
file_path = "output_table_file_map.csv"  # Update this path to your actual file location
df = pd.read_csv(file_path)

# Define the desired order for 'Type' and 'Analysis' columns
type_order = ['Summary', 'Plot', 'Numeric']
analysis_order = ['Main', 'Supplemental']
context_order = ['Summary', 'Model Diagnostic', 'MCMC Diagnostic']

# Replace NaNs in 'Type' column with empty string to process them
df['Type'] = df['Type'].fillna('')

# Assign 'Numeric' type to any rows with filenames ending in .csv
df.loc[df['Filename'].str.endswith('.csv'), 'Type'] = 'Numeric'

# Add the 'Context' column and initialize with an empty string
df['Context'] = ''

# Fill 'Context' column based on rules
df.loc[(df['Model'].str.contains('all', case=False, na=False)) | 
       (df['Filename'].str.contains('summary', case=False, na=False)), 'Context'] = 'Summary'

df.loc[df['Filename'].str.contains('tplots|geweke|grrhat', case=False, na=False), 'Context'] = 'MCMC Diagnostic'

df.loc[df['Filename'].str.contains('lppd|loo|resid|outlier', case=False, na=False), 'Context'] = 'Model Diagnostic'

# Convert the columns to categorical with the specified order
df['Analysis'] = pd.Categorical(df['Analysis'], categories=analysis_order, ordered=True)
df['Type'] = pd.Categorical(df['Type'], categories=type_order, ordered=True)
df['Context'] = pd.Categorical(df['Context'], categories=context_order, ordered=True)

# Sort the dataframe by 'Type' and then by 'Analysis'
sorted_df = df.sort_values(by=['Analysis', 'Type', 'Context'])

# Save the sorted dataframe back to a CSV (or modify this to write Markdown if needed)
output_csv_path = "sorted_filenames_metadata.csv"

if os.path.exists(output_csv_path):
    os.remove(output_csv_path)

sorted_df.to_csv(output_csv_path, index=False)

# Generate the markdown table
output_path = "sorted_filenames_metadata.md"

if os.path.exists(output_path):
    os.remove(output_path)

with open(output_path, "w") as f:
    # Write the table header
    f.write("| Analysis | Type | Context | Model | Script Section | Filename |\n")
    f.write("|----------|------|---------|-------|----------------|----------|\n")
    
    # Write each row as a markdown table row
    for _, row in sorted_df.iterrows():
        f.write(f"| {row['Analysis']} | {row['Type']} | {row['Context']} | {row['Model']} | {row['Script Section']} | {row['Filename']} |\n")

sorted_df

After sorting and adding context/metadata, the filenames are stored in `sorted_filenames_metadata.csv` which is then used to build a series of markdown tables decribing all of the files in the Output folder divided into main, supplemental, numeric, and plot tables (turned out to be necessary in order to easily display the content in a simple markdown document without having to tweak the underlying LaTex)---analytical products.

In [None]:
# Load the pre-sorted CSV file
file_path = "sorted_filenames_metadata.csv" 
df = pd.read_csv(file_path)

def save_grouped_table(df_subset, filename):
    with open(filename, "w") as f:
        # Write the table header
        f.write("| Context | Script Section | Filename |\n")
        f.write("|---------|----------------|----------|\n")

        # Initialize tracking for the current group
        current_group = None

        for _, row in df_subset.iterrows():
            # Extract the group name (substring before the first `_`)
            split_filename = row['Filename'].split('_', 1)
            group_name = split_filename[0]
            
            # Check if a new group starts
            if group_name != current_group:
                # Write a group header row with the trailing underscore and ellipsis
                f.write(f"| | | **{group_name}_...** |\n")  # Empty Context and Script Section
                current_group = group_name

            # Write the filename row, removing the group prefix
            cleaned_filename = split_filename[1] if len(split_filename) > 1 else split_filename[0]
            f.write(f"| {row['Context']} | {row['Script Section']} | {cleaned_filename} |\n")

# Example usage with subsets of the DataFrame
save_grouped_table(main_plot, "grouped_main_plot.md")
save_grouped_table(main_numeric, "grouped_main_numeric.md")
save_grouped_table(supplemental_plot, "grouped_supplemental_plot.md")
save_grouped_table(supplemental_numeric, "grouped_supplemental_numeric.md")

Compress the PNGs so that the final supplement.pdf isn't so big.

In [7]:
from PIL import Image
from pathlib import Path

def compress_png_pillow(input_dir, output_dir, resize_factor=0.5, quality=85):
    """
    Compress PNG files using Pillow by resizing and adjusting quality.
    """
    input_dir = Path(input_dir)
    output_dir = Path(output_dir)
    output_dir.mkdir(exist_ok=True)  # Create output directory if it doesn't exist

    for png_file in input_dir.glob("*.png"):
        with Image.open(png_file) as img:  # Ensure file is properly closed after processing
            # Resize image
            new_dimensions = (
                int(img.width * resize_factor),
                int(img.height * resize_factor)
            )
            img_resized = img.resize(new_dimensions, Image.Resampling.LANCZOS)

            # Save compressed image
            compressed_file = output_dir / png_file.name
            img_resized.save(compressed_file, optimize=True, quality=quality)
            print(f"Compressed {png_file} to {compressed_file}")

# Run the function
compress_png_pillow("../Output", "compressed_images", 0.5, 60)



Compressed ..\Output\geweke_histogram.png to compressed_images\geweke_histogram.png
Compressed ..\Output\grrhat_histogram.png to compressed_images\grrhat_histogram.png
Compressed ..\Output\pa_gini_results.png to compressed_images\pa_gini_results.png
Compressed ..\Output\point_scatters.png to compressed_images\point_scatters.png
Compressed ..\Output\point_scatters_linear.png to compressed_images\point_scatters_linear.png
Compressed ..\Output\point_scatters_linear_log.png to compressed_images\point_scatters_linear_log.png
Compressed ..\Output\resid_allmonuments.png to compressed_images\resid_allmonuments.png
Compressed ..\Output\resid_allmonuments_linlog.png to compressed_images\resid_allmonuments_linlog.png
Compressed ..\Output\resid_allwalls.png to compressed_images\resid_allwalls.png
Compressed ..\Output\resid_allwalls_linlog.png to compressed_images\resid_allwalls_linlog.png
Compressed ..\Output\resid_epigraphy.png to compressed_images\resid_epigraphy.png
Compressed ..\Output\resid_e

## Trace Plots
We now create a traceplot section for the supplement by adding the traceplot PNG files to a separate markdown document that will be included in the main supplement.

Everything that follows happens within the Suppelement subfolder of the primary project Repo. Create a markdown file containing all the trace plots that can later be included in the main supplement file.

In [6]:
# Read the metadata table
metadata_path = "sorted_filenames_metadata.csv"
metadata = pd.read_csv(metadata_path)

# Filter for `tplots_...png` files and maintain order from metadata
tplots_files = metadata[metadata['Filename'].str.contains('tplots_') & metadata['Filename'].str.endswith('.png')]

# Generate the Markdown content
md_content = "# Trace Plots\n\n"
md_content += "This section contains the trace plots (`tplots_...png`) for the MCMC diagnostics.\n\n"

for _, row in tplots_files.iterrows():
    filename = row['Filename']
    relative_path = f"compressed_images/{filename}"
    md_content += f"### {filename}\n"
    md_content += f"![{filename}]({relative_path})\n\n"

# Write the Markdown content to a new file
output_md_path = "tplots_section.md"

if os.path.exists(output_md_path):
    os.remove(output_md_path)

with open(output_md_path, "w") as md_file:
    md_file.write(md_content)

print(f"Markdown section for tplots created at: {output_md_path}")


Markdown section for tplots created at: tplots_section.md


## Summary of Parameter Posteriors
The anlaysis produed hundreds of parameter posterior estimates across the various models and supplemental anlayses. We summarize those here with another markdown table generated by taking in the posterior summaries produced by the R script (as CSVs) and formatting the data into a markdown table.

Create markdown summary table of consolidated model posterior summaries for key scaling variables, which will also be included in the main supplement file.

In [None]:
# Define the output directory containing the "post_summary..." files
output_dir = "../Output"
markdown_file = "consolidated_post_summary.md"

# Get all the "post_summary..." files
post_summary_files = [
    f for f in os.listdir(output_dir) if f.startswith("post_summary") and f.endswith(".csv")
]

# Prepare a consolidated DataFrame
consolidated_df = pd.DataFrame()

# Loop through each file and extract the relevant rows
for file in post_summary_files:
    file_path = os.path.join(output_dir, file)
    df = pd.read_csv(file_path)
    
    df.rename(columns={df.columns[0]: "param"}, inplace=True)

    # Extract the relevant rows
    filtered_df = df[df['param'].isin(['b0', 'b1', 'intercept', 'scaling'])]
    
    # Add a new column for the model/analysis (taken from the file name)
    filtered_df['model'] = file.replace("post_summary_", "").replace(".csv", "")
    
    # Append to the consolidated DataFrame
    consolidated_df = pd.concat([consolidated_df, filtered_df], ignore_index=True)

# Rearrange the columns
consolidated_df = consolidated_df[['model', 'param', 'lower', 'upper', 'mean', 'stdd']]

# Create the markdown table
markdown_table = consolidated_df.to_markdown(index=False)

# Write the markdown to the file
with open(markdown_file, "w") as f:
    f.write("# Consolidated Parameter Summary\n\n")
    f.write(markdown_table)

print(f"Markdown section written to: {markdown_file}")


Markdown section written to: consolidated_post_summary.md


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['model'] = file.replace("post_summary_", "").replace(".csv", "")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['model'] = file.replace("post_summary_", "").replace(".csv", "")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['model'] = file.replace("post_summary_"