## Toy data generation

This code suite generates synthetic demographic datasets for testing and analysis. The `random_partition` function creates mathematically valid random partitions where numbers sum to specified totals, serving as the core engine. The `toy_census` function generates population data for geographic areas, producing category counts (like gender and age groups) that correctly sum to each area's population size. The `toy_survey` function creates individual response data where each simulated person selects single choices from multiple categories using one-hot encoding. Together, these functions output properly formatted CSV files with headers and data rows, creating ready-to-use synthetic datasets for statistical analysis, machine learning, or data visualization projects.

In [6]:
function random_partition(length::Int, total::Int)
    # Create a vector of zeros with the specified length
    vect = zeros(Int, length)
    
    # Generate (length-1) random cut points between 0 and total, then sort them
    # This is like randomly placing markers along a rope of length 'total'
    cuts = sort(rand(0:total, length-1))
    
    # The first segment goes from 0 to the first cut point
    vect[1] = cuts[1]
    
    # For middle segments: each segment is the distance between consecutive cut points
    for i in 2:length-1
        vect[i] = cuts[i] - cuts[i-1]  # Difference between adjacent cuts
    end
    
    # The last segment goes from the last cut point to the total length
    vect[length] = total - cuts[end]
    
    # Return the partitioned vector (all segments sum to 'total')
    return vect
end


random_partition (generic function with 1 method)

## **How It Works (Nautical Metaphor):** 

Imagine ye have a rope of length `total`. Ye want to cut it into `length` pieces:

1. **Random Cut Points**: Place `length-1` random marks along the rope
2. **Sort Marks**: Arrange the marks in order from smallest to largest  
3. **Measure Pieces**: 
   - First piece: from start to first mark
   - Middle pieces: between consecutive marks
   - Last piece: from last mark to end

**Guaranteed**: All pieces sum to the original rope length! 

---

## **Example:**
```julia
# Cut a rope of length 10 into 3 pieces
result = random_partition(3, 10)
# Might produce: [3, 4, 3] → 3 + 4 + 3 = 10 
```

This be a mathematically elegant way to generate random partitions with uniform distribution! 

In [7]:
function toy_cencus(headers::Vector{Vector{String}}, filename::String, number_of_areas::Int)
    # Open the CSV file for writing (creates or overwrites)
    open(filename, "w") do file
        
        # Flatten the nested header structure into a single vector
        # Example: [["m", "f"], ["0-16", "17-21"]] becomes ["m", "f", "0-16", "17-21"]
        header = reduce(vcat, headers)
        
        # Convert the header vector to a CSV line by joining with commas
        csv_line = join(header, ",")
        
        # Write the header line to the file (adds newline character)
        write(file, csv_line * "\n")
        
        # Generate random census data for the specified number of areas
        for _ in 1:number_of_areas
            
            # Randomly generate a population size between 50 and 100 for this area
            population_size = rand(50:100)
            
            # Initialize an empty vector to store all the census counts
            vect = []
            
            # For each header group (e.g., genders, age groups)
            for v in headers
                # Generate random partitions that sum to the population_size
                # length(v) determines how many categories in this group
                vect_inner = random_partition(length(v), population_size)
                
                # Concatenate the new partitions with existing results
                vect = vcat(vect, vect_inner)
            end
            
            # Convert all numbers to strings for CSV output
            string_vector = string.(vect)  # Broadcast conversion to string
            
            # Create a CSV line by joining the string vector with commas
            csv_line = join(string_vector, ",")  # "23,77,15,32,28,12,13"
            
            # Write the data line to the CSV file
            write(file, csv_line * "\n")
        end    
    end
end

toy_cencus (generic function with 1 method)


## **What This Function Does:** 

1. **Creates a CSV file** with headers from nested string vectors
2. **Generates random census data** for multiple geographic areas
3. **Each area** has a random population size between 50-100
4. **For each header group**, creates random partitions that sum to the area's population
5. **Writes both header and data** in proper CSV format

---

## **Example Output CSV:**
```
m,f,0-16,17-21,21-40,40-66,67+
23,77,15,32,28,12,13
45,55,8,41,22,19,10
...more rows...
```

**Header Structure:** `[["m", "f"], ["0-16", "17-21", "21-40", "40-66", "67+"]]`  
**Data Pattern:** Each row has 2 gender counts + 5 age group counts = 7 numbers  
**Sum Guarantee:** Each group sums to the area's population size 

---


In [8]:
function toy_survay(headers::Vector{Vector{String}}, filename::String, size_of_survay::Int)
    # Open the CSV file for writing (creates or overwrites)
    open(filename, "w") do file
        
        # Flatten the nested header structure into a single vector
        # Example: [["m", "f"], ["0-16", "17-21"]] becomes ["m", "f", "0-16", "17-21"]
        header = reduce(vcat, headers)
        
        # Convert the header vector to a CSV line by joining with commas
        csv_line = join(header, ",")
        
        # Write the header line to the file (adds newline character)
        write(file, csv_line * "\n")
        
        # Generate random survey responses for the specified number of participants
        for _ in 1:size_of_survay
            
            # Initialize an empty vector to store all the survey responses
            vect = []
            
            # For each header group (e.g., genders, age groups, questions)
            for v in headers
                # Create a vector of zeros with same length as this header group
                vect_inner = zeros(Int, length(v))
                
                # Randomly select one category in this group and set it to 1
                # This represents a single-choice response (like radio buttons)
                vect_inner[rand(1:length(v))] = 1
                
                # Concatenate the new response with existing results
                vect = vcat(vect, vect_inner)
            end
            
            # Convert all numbers to strings for CSV output (0s and 1s)
            string_vector = string.(vect)  # Broadcast conversion to string
            
            # Create a CSV line by joining the string vector with commas
            csv_line = join(string_vector, ",")  # "0,1,1,0,0,0" etc.
            
            # Write the survey response line to the CSV file
            write(file, csv_line * "\n")
        end    
    end
end

toy_survay (generic function with 1 method)

## **What This Function Does:** 

1. **Creates a CSV file** with headers from nested string vectors
2. **Generates random survey responses** for multiple participants
3. **Each participant** provides single-choice responses for each category group
4. **Uses one-hot encoding**: One `1` per group, rest are `0`s
5. **Writes both header and data** in proper CSV format

---

## **Example Output CSV:**
```
gender,age,employed
m,f,0-16,17-21,21-40,40-66,67+,yes,no
0,1,0,0,1,0,0,1,0
1,0,1,0,0,0,0,0,1
...more rows...
```

**Header Structure:** `[["m", "f"], ["0-16", "17-21", "21-40", "40-66", "67+"], ["yes", "no"]]`  
**Data Pattern:** Each row has one `1` per group, all others `0`  
**Single Choice:** Each group represents exclusive options (like radio buttons)

---



In [9]:
toy_cencus([["m", "f"], ["0-16", "17-21", "21-40", "40-66", "67+"]],"Data/toy_cencus.csv",10)
toy_survay([["m", "f"], ["0-16", "17-21", "21-40", "40-66", "67+"]],"Data/toy_survay.csv",200)