## Step 2: Encoding

### Why This Step?

Our dataset includes a mix of numerical and categorical variables—such as binary responses, ordinal scales, nominal categories, and multi-value fields—that must be transformed into a numerical format. Encoding is critical to ensure that these variables are represented in a way that preserves their meaning, structure, and relationships, enabling accurate imputation without introducing bias or false assumptions.

### Key Objectives:

- **Numerical Conversion**: Transform all categorical data into a machine-readable numerical format.
- **Preserve Meaning**: Maintain the inherent properties of the data (e.g., order in ordinal variables, independence in nominal variables).
- **Support Imputation**: Retain missing values (NaNs) for later imputation while ensuring the encoded data is interpretable by machine learning models.
- **Handle Complexity**: Address multi-value fields and diverse variable types appropriately.

### Methodology

We applied a tailored encoding strategy to accommodate the diverse variable types in the dataset. Each method was chosen based on the nature of the variables and their intended use in imputation. Below is a detailed breakdown of the encoding approaches:

#### 1. Numerical Columns
- **Columns**: `Age`, `Nb enfants`, `TAS`, `TAD`, `Poids`, `Taille`, etc.
- **Method**: Retained in their original numeric form, converted to float using `pd.to_numeric()` to support NaNs.
- **Rationale**: These variables are already numerical and require no transformation beyond ensuring a consistent data type that preserves missing values.

#### 2. Binary Columns
- **Columns**: `Neffa`, `Fumées de Tabouna`, `AT en milieu agricole`, `Ménopause`
- **Method**: Mapped `'oui'` → `1.0`, `'non'` → `0.0`, with NaNs preserved as-is.
  - **Special Case**: `Tabagisme` mapped as `'non'` → `0.0`, `'passif'` → `1.0`, `'oui'` → `2.0` to reflect its three distinct categories.
- **Rationale**: Binary encoding converts yes/no responses into a simple numerical format, while the extended mapping for `Tabagisme` captures its unique gradation. Preserving NaNs ensures missing data remains intact for imputation.

#### 3. Ordinal Columns
- **Columns**: `Masque pour pesticides`, `Bottes`, `Gants`, `Casquette/Mdhalla`, `Manteau imperméable`
- **Method**: Mapped with an ordered scale:
  - `'jamais'` → `0.0`, `'parfois'` → `1.0`, `'souvent'` → `2.0`, `'toujours'` → `3.0`.
  - **Additional Column**: `Catégorie professionnelle` with a custom ordinal mapping (e.g., `'agricultrice indépendante'` → `0.0`, `'ouvrière'` → `1.0`).
- **Rationale**: Ordinal encoding reflects the natural progression in frequency or status, which is essential for imputation models to interpret relative differences accurately.

#### 4. Categorical Columns with Natural Order
- **Columns**: `Situation maritale`, `Niveau scolaire`, `Niveau socio-économique`, etc.
- **Method**: Applied explicit ordinal mappings:
  - `Situation maritale`: `'célibataire'` → `0.0`, `'mariée'` → `1.0`, `'divorcée'` → `2.0`, `'veuve'` → `3.0`
  - `Niveau scolaire`: `'analphabète'` → `0.0`, `'primaire'` → `1.0`, `'secondaire'` → `2.0`, `'supérieur'` → `3.0`
  - `Niveau socio-économique`: `'bas'` → `0.0`, `'moyen'` → `1.0`, `'bon'` → `2.0`
- **Rationale**: These variables have a logical hierarchy, and ordinal encoding preserves this order, providing meaningful numerical relationships for machine learning models.

#### 5. Multi-Value Fields
- **Columns**: `Produits chimiques utilisés`, `Moyen de transport`, `Engrais utilisés`, etc.
- **Method**: Split comma-separated values into individual binary indicators (e.g., `Chemical_pesticides: 1.0` if present, `0.0` if absent, NaN if original value is missing).
- **Rationale**: This approach captures the presence or absence of each item independently, avoiding any implied order and allowing the imputation model to treat each indicator as a separate feature while preserving missingness.

#### 6. Nominal Categorical Columns
- **Columns**: `Profession du mari`
- **Method**: Applied one-hot encoding, creating binary columns for each category (e.g., `Profession du mari_teacher: 1` if true, `0` if false), with a dedicated NaN indicator (e.g., `Profession du mari_nan`).
- **Rationale**: One-hot encoding prevents the introduction of false ordinal relationships in nominal data, and the NaN indicator explicitly tracks missing values, ensuring they are not lost during encoding.

### Key Considerations

- **Preserving Missing Values**: NaNs were retained across all encoding methods rather than imputed at this stage. This preserves the integrity of the dataset for the subsequent machine learning-based imputation step.
- **Data Integrity**: Encoding methods were selected to match the variable type (e.g., ordinal for ordered data, one-hot for unordered data), avoiding misrepresentation of relationships.
- **Standardization**: Multi-value fields were processed consistently (e.g., handling commas, slashes, and case sensitivity) to prevent duplication and ensure uniformity.
- **Alignment with Code**: The encoding descriptions here directly correspond to the implementation in the accompanying Python code, ensuring consistency between documentation and execution.

### Outcome

The encoding process transformed the dataset into a fully numerical format suitable for machine learning-based imputation. The final encoded dataset includes:

- **80 observations** (rows)
- **76 features** (columns, expanded due to multi-value and one-hot encoding)

This structured and deliberate encoding approach ensures that the dataset retains its original meaning and structure, setting the stage for effective imputation while accommodating the diverse nature of the variables.


In [7]:
import pandas as pd
import numpy as np

# Step 1: Load the Data
input_data = pd.read_excel('standardized_female_farmers_data_no_text.xlsx')
encoded_data = input_data.copy()

# Step 2: Preprocess 'Profession du mari'
if 'Profession du mari' in encoded_data.columns:
    # Standardize: lowercase and strip whitespace
    encoded_data['Profession du mari'] = encoded_data['Profession du mari'].str.lower().str.strip()
    # Replace '0' with NaN to treat it as a missing value
    encoded_data['Profession du mari'] = encoded_data['Profession du mari'].replace('0', np.nan)
    print("Standardized and replaced '0' with NaN in 'Profession du mari'")

# Step 3: Proceed with Encoding

## 3.1 Define Column Categories
# Get available columns from the dataset
available_columns = encoded_data.columns.tolist()

# Numerical columns (will be float to handle NaN)
numerical_cols = [col for col in ['N°', 'Age', 'Nb enfants', 'Nb pers à charge', 'H travail / jour', 
                                  'Age ménopause', 'Ancienneté agricole', 'J travail / Sem', 
                                  'Poids', 'Taille', 'TAS', 'TAD', 'GAD'] 
                  if col in available_columns]

# Binary columns (yes/no)
binary_cols = [col for col in ['Neffa', 'Fumées de Tabouna', 'AT en milieu agricole', 'Ménopause'] 
               if col in available_columns]

# Ordinal columns for equipment usage
ordinal_cols = [col for col in ['Masque pour pesticides', 'Bottes', 'Gants', 'Casquette/Mdhalla', 
                                'Manteau imperméable'] if col in available_columns]

# Ordinal categorical columns
ordinal_categorical_cols = [col for col in ['Situation maritale', 'Domicile', 'Niveau socio-économique',
                                            'Statut', 'Niveau scolaire'] if col in available_columns]

# Multi-value columns (comma-separated values)
multi_value_cols = [col for col in ['Produits chimiques utilisés', 'Produits biologiques utilisés', 
                                    'Engrais utilisés', 'Contraintes thermiques', 'Moyen de transport'] 
                    if col in available_columns]

# Columns to one-hot encode (includes 'Profession du mari')
one_hot_encode_cols = ['Profession du mari'] if 'Profession du mari' in available_columns else []

## 3.2 Binary Encoding
# Map "oui" to 1.0, "non" to 0.0, preserving NaN as float
for col in binary_cols:
    encoded_data[col] = encoded_data[col].map({'oui': 1.0, 'non': 0.0})

## 3.3 Special Encoding for 'Tabagisme'
# Map "non" → 0.0, "passif" → 1.0, "oui" → 2.0, preserving NaN
if 'Tabagisme' in available_columns:
    tabagisme_map = {'non': 0.0, 'passif': 1.0, 'oui': 2.0}
    encoded_data['Tabagisme'] = encoded_data['Tabagisme'].map(tabagisme_map)
    print(f"Tabagisme encoded as: {tabagisme_map}")

## 3.4 Ordinal Encoding for Equipment Usage
# Map "jamais" → 0.0, "parfois" → 1.0, "souvent" → 2.0, "toujours" → 3.0, preserving NaN
ordinal_map = {'jamais': 0.0, 'parfois': 1.0, 'souvent': 2.0, 'toujours': 3.0}
for col in ordinal_cols:
    encoded_data[col] = encoded_data[col].map(ordinal_map)
    print(f"{col} encoded with mapping: {ordinal_map}")

## 3.5 Ordinal Encoding for Categorical Columns
ordinal_mappings = {
    'Situation maritale': {'célibataire': 0.0, 'mariée': 1.0, 'divorcée': 2.0, 'veuve': 3.0},
    'Domicile': {'monastir': 0.0, 'sfax': 1.0, 'mahdia': 2.0},
    'Niveau socio-économique': {'bas': 0.0, 'moyen': 1.0, 'bon': 2.0},
    'Statut': {'permanente': 0.0, 'saisonnière': 1.0},
    'Niveau scolaire': {'analphabète': 0.0, 'primaire': 1.0, 'secondaire': 2.0, 'supérieur': 3.0}
}

for col in ordinal_categorical_cols[:]:  # Use a copy to allow modification
    mapping = ordinal_mappings.get(col, {})
    if mapping:
        # Check if all values can be mapped
        unmappable = encoded_data[col].apply(lambda x: x not in mapping and pd.notna(x)).any()
        if not unmappable:
            encoded_data[col] = encoded_data[col].map(mapping)
            print(f"{col} encoded with mapping: {mapping}")
        else:
            one_hot_encode_cols.append(col)
            ordinal_categorical_cols.remove(col)
            print(f"{col} has unmappable values and will be one-hot encoded instead")
    else:
        one_hot_encode_cols.append(col)
        ordinal_categorical_cols.remove(col)
        print(f"No mapping defined for {col}, will be one-hot encoded")

## 3.6 Process Multi-Value Columns
# Create binary indicators with NaN preservation
prefixes = {
    'Produits chimiques utilisés': 'Chemical',
    'Produits biologiques utilisés': 'Bio',
    'Engrais utilisés': 'Fertilizer',
    'Contraintes thermiques': 'Thermal',
    'Moyen de transport': 'Transport'
}

for col in multi_value_cols:
    if col in available_columns:
        prefix = prefixes.get(col, col.replace(' ', ''))
        encoded_data[col] = encoded_data[col].astype(str).str.lower().str.strip()
        encoded_data[col] = encoded_data[col].replace('nan', np.nan)
        
        # Extract unique values, excluding NaN
        all_values = encoded_data[col].dropna().str.split(',').explode().str.strip().unique()
        unique_values = sorted(set(all_values))
        print(f"Found {len(unique_values)} unique values in {col}: {unique_values}")
        
        for value in unique_values:
            col_name = f"{prefix}_{value}".replace(' ', '_').replace('-', '_').replace('/', '_')
            encoded_data[col_name] = encoded_data[col].apply(
                lambda x: 1.0 if pd.notna(x) and value in x.split(',') else 
                          (0.0 if pd.notna(x) else np.nan)
            )
            print(f"  Created binary column: {col_name}")

## 3.7 One-Hot Encoding
# Apply one-hot encoding with dummy_na=True (resulting columns are 0/1, no NaN)
if one_hot_encode_cols:
    print(f"One-hot encoding the following columns: {one_hot_encode_cols}")
    encoded_data = pd.get_dummies(encoded_data, columns=one_hot_encode_cols, 
                                  dummy_na=True, prefix_sep='_', dtype=int)

## 3.8 Ensure Numerical Columns are Properly Typed
for col in numerical_cols:
    encoded_data[col] = pd.to_numeric(encoded_data[col], errors='coerce')

## 3.9 Remove Original Multi-Value Columns
encoded_data = encoded_data.drop(columns=multi_value_cols, errors='ignore')

# Step 4: Save the Encoded Data
encoded_data.to_excel('encoded_female_farmers_data_no_text.xlsx', index=False)
print("\nEncoded data saved successfully to 'encoded_female_farmers_data_no_text.xlsx'")

# Display Summary
print("\nEncoding summary:")
print(f"- {len(numerical_cols)} numerical columns preserved (float with NaN)")
print(f"- {len(binary_cols)} binary columns encoded (float: 1.0, 0.0, NaN)")
print(f"- {len(ordinal_cols)} ordinal columns encoded (float: 0.0 to 3.0, NaN)")
print(f"- {len(ordinal_categorical_cols)} categorical columns ordinally encoded (float with NaN)")
print(f"- {len(multi_value_cols)} multi-value columns converted to binary indicators (float: 1.0, 0.0, NaN)")
print(f"- {len(one_hot_encode_cols)} columns one-hot encoded (int: 0 or 1)")
print(f"- Final dataset shape: {encoded_data.shape}")

# Verification
if 'Profession du mari_0' not in encoded_data.columns:
    print("Success: 'Profession du mari_0' has been eliminated.")
else:
    print("Warning: 'Profession du mari_0' still exists.")

Standardized and replaced '0' with NaN in 'Profession du mari'
Tabagisme encoded as: {'non': 0.0, 'passif': 1.0, 'oui': 2.0}
Masque pour pesticides encoded with mapping: {'jamais': 0.0, 'parfois': 1.0, 'souvent': 2.0, 'toujours': 3.0}
Bottes encoded with mapping: {'jamais': 0.0, 'parfois': 1.0, 'souvent': 2.0, 'toujours': 3.0}
Gants encoded with mapping: {'jamais': 0.0, 'parfois': 1.0, 'souvent': 2.0, 'toujours': 3.0}
Casquette/Mdhalla encoded with mapping: {'jamais': 0.0, 'parfois': 1.0, 'souvent': 2.0, 'toujours': 3.0}
Manteau imperméable encoded with mapping: {'jamais': 0.0, 'parfois': 1.0, 'souvent': 2.0, 'toujours': 3.0}
Situation maritale encoded with mapping: {'célibataire': 0.0, 'mariée': 1.0, 'divorcée': 2.0, 'veuve': 3.0}
Domicile encoded with mapping: {'monastir': 0.0, 'sfax': 1.0, 'mahdia': 2.0}
Niveau socio-économique encoded with mapping: {'bas': 0.0, 'moyen': 1.0, 'bon': 2.0}
Statut encoded with mapping: {'permanente': 0.0, 'saisonnière': 1.0}
Niveau scolaire encoded wit

In [9]:
import pandas as pd
import numpy as np
import json
from datetime import datetime

def generate_codebook(encoded_data):
    """
    Generate a comprehensive codebook for the encoded dataset to facilitate later decoding.
    """
    # Initialize codebook
    codebook = {
        "dataset_info": {
            "name": "Female Farmers Health Dataset",
            "encoded_date": datetime.now().strftime("%Y-%m-%d"),
            "n_rows": encoded_data.shape[0],
            "n_columns": encoded_data.shape[1]
        },
        "column_types": {},
        "encoding_mappings": {},
        "binary_indicators": {},
        "one_hot_encodings": {},
        "missing_value_handling": {}
    }
    
    # Define column categories
    numerical_cols = [col for col in ['N°', 'Age', 'Nb enfants', 'Nb pers à charge', 'H travail / jour', 
                                      'Age ménopause', 'Ancienneté agricole', 'J travail / Sem', 
                                      'Poids', 'Taille', 'TAS', 'TAD', 'GAD'] 
                      if col in encoded_data.columns]
    
    binary_cols = [col for col in ['Neffa', 'Fumées de Tabouna', 'AT en milieu agricole', 'Ménopause'] 
                   if col in encoded_data.columns]
    
    ordinal_cols = [col for col in ['Masque pour pesticides', 'Bottes', 'Gants', 'Casquette/Mdhalla', 
                                    'Manteau imperméable', 'Catégorie professionnelle'] 
                    if col in encoded_data.columns]
    
    ordinal_categorical_cols = [col for col in ['Situation maritale', 'Domicile', 'Niveau socio-économique',
                                                'Statut', 'Niveau scolaire'] 
                                if col in encoded_data.columns]
    
    # Add mappings used for encoding (using floats where applicable)
    codebook["encoding_mappings"]["binary"] = {'oui': 1.0, 'non': 0.0}
    codebook["encoding_mappings"]["tabagisme"] = {'non': 0.0, 'passif': 1.0, 'oui': 2.0}
    codebook["encoding_mappings"]["ordinal_equipment"] = {'jamais': 0.0, 'parfois': 1.0, 'souvent': 2.0, 'toujours': 3.0}
    codebook["encoding_mappings"]["profession"] = {
        'agricultrice indépendante': 0.0, 
        'ouvrière': 1.0, 
        'ouvrière, agricultrice indépendante': 2.0,
        'pêcheur indépendante': 3.0
    }
    codebook["encoding_mappings"]["ordinal_categorical"] = {
        'Situation maritale': {'célibataire': 0.0, 'mariée': 1.0, 'divorcée': 2.0, 'veuve': 3.0},
        'Domicile': {'monastir': 0.0, 'sfax': 1.0, 'mahdia': 2.0},
        'Niveau socio-économique': {'bas': 0.0, 'moyen': 1.0, 'bon': 2.0},
        'Statut': {'permanente': 0.0, 'saisonnière': 1.0},
        'Niveau scolaire': {'analphabète': 0.0, 'primaire': 1.0, 'secondaire': 2.0, 'supérieur': 3.0}
    }
    
    # Add multi-value column mappings
    prefixes = {
        'Produits chimiques utilisés': 'Chemical',
        'Produits biologiques utilisés': 'Bio',
        'Engrais utilisés': 'Fertilizer',
        'Contraintes thermiques': 'Thermal',
        'Moyen de transport': 'Transport'
    }
    codebook["encoding_mappings"]["prefixes"] = prefixes
    
    # Record multi-value binary indicators
    codebook["binary_indicators"] = {}
    for prefix, original_col in [(v, k) for k, v in prefixes.items()]:
        indicator_cols = [col for col in encoded_data.columns if col.startswith(f"{prefix}_")]
        values = [col.replace(f"{prefix}_", "").replace("_", " ") for col in indicator_cols]
        if indicator_cols:
            codebook["binary_indicators"][original_col] = {
                "prefix": prefix,
                "indicators": dict(zip(indicator_cols, values))
            }
    
    # Record one-hot encodings
    one_hot_cols = {}
    for col in encoded_data.columns:
        if '_' in col and not any(col.startswith(prefix + '_') for prefix in prefixes.values()):
            prefix = col.split('_')[0]
            if prefix not in one_hot_cols:
                one_hot_cols[prefix] = []
            one_hot_cols[prefix].append(col)
    
    for prefix, cols in one_hot_cols.items():
        original_col = prefix
        # Filter out the NaN column for the mapping dictionary
        non_nan_cols = [col for col in cols if not col.endswith('_nan')]
        nan_cols = [col for col in cols if col.endswith('_nan')]
        
        values = [col.replace(f"{prefix}_", "") for col in non_nan_cols]
        
        codebook["one_hot_encodings"][original_col] = {
            "prefix": prefix,
            "columns": dict(zip(non_nan_cols, values)),
            "missing_indicator": nan_cols[0] if nan_cols else None
        }
    
    # Add column type information
    for col in encoded_data.columns:
        if col in numerical_cols:
            codebook["column_types"][col] = "numerical"
        elif col in binary_cols:
            codebook["column_types"][col] = "binary"
        elif col in ordinal_cols:
            if col == 'Catégorie professionnelle':
                codebook["column_types"][col] = "ordinal_profession"
            else:
                codebook["column_types"][col] = "ordinal_equipment"
        elif col in ordinal_categorical_cols:
            codebook["column_types"][col] = "ordinal_categorical"
        elif any(col.startswith(prefix + '_') for prefix in prefixes.values()):
            codebook["column_types"][col] = "binary_indicator"
        elif '_' in col and col.split('_')[0] not in prefixes.values():
            # For one-hot encoding, correctly identify the nan column
            if col.endswith('_nan'):
                codebook["column_types"][col] = "one_hot_missing"
            else:
                codebook["column_types"][col] = "one_hot"
        else:
            codebook["column_types"][col] = "other"
    
    # Describe each column
    codebook["column_descriptions"] = {}
    for col in encoded_data.columns:
        col_type = codebook["column_types"].get(col, "unknown")
        description = {
            "type": col_type,
            "dtype": str(encoded_data[col].dtype),
            "missing_count": int(encoded_data[col].isna().sum()),
            "missing_percentage": round(encoded_data[col].isna().mean() * 100, 2)
        }
        
        if col_type in ["numerical", "binary", "ordinal_equipment", "ordinal_profession", "ordinal_categorical"]:
            if pd.api.types.is_numeric_dtype(encoded_data[col]):
                description["min"] = float(encoded_data[col].min()) if not pd.isna(encoded_data[col].min()) else None
                description["max"] = float(encoded_data[col].max()) if not pd.isna(encoded_data[col].max()) else None
                description["mean"] = float(encoded_data[col].mean()) if not pd.isna(encoded_data[col].mean()) else None
            if col_type in ["binary", "ordinal_equipment", "ordinal_profession", "ordinal_categorical"]:
                description["missing_value"] = "NaN indicates missing"
        
        if col_type == "binary_indicator":
            description["values"] = [0.0, 1.0]
            description["missing_value"] = "NaN if original value missing"
        elif col_type in ["one_hot", "one_hot_missing"]:
            description["values"] = [0, 1]
            if col_type == "one_hot_missing":
                description["missing_value"] = "1 indicates missing value"
            else:
                prefix = col.split('_')[0]
                nan_col = f"{prefix}_nan"
                if nan_col in encoded_data.columns:
                    description["missing_value"] = f"indicated by {nan_col} = 1"
        
        # For specific ordinal mappings, include the mapping
        if col_type == "ordinal_equipment":
            description["mapping"] = codebook["encoding_mappings"]["ordinal_equipment"]
        elif col_type == "ordinal_profession":
            description["mapping"] = codebook["encoding_mappings"]["profession"]
        elif col_type == "ordinal_categorical" and col in codebook["encoding_mappings"]["ordinal_categorical"]:
            description["mapping"] = codebook["encoding_mappings"]["ordinal_categorical"][col]
            
        codebook["column_descriptions"][col] = description
    
    # Add missing value handling instructions
    codebook["missing_value_handling"] = {
        "binary": "NaN indicates missing",
        "ordinal_equipment": "NaN indicates missing",
        "ordinal_profession": "NaN indicates missing",
        "ordinal_categorical": "NaN indicates missing",
        "binary_indicator": "if all indicators for a prefix are NaN, original value was missing",
        "one_hot": "indicated by the _nan column for each group",
        "numerical": "NaN indicates missing"
    }
    
    return codebook

# Load the encoded data
encoded_data = pd.read_excel('encoded_female_farmers_data_no_text.xlsx')

# Generate the codebook
codebook = generate_codebook(encoded_data)

# Save the codebook to JSON file for later use
with open('female_farmers_codebook.json', 'w', encoding='utf-8') as f:
    json.dump(codebook, f, ensure_ascii=False, indent=2)

print("Codebook generated and saved to 'female_farmers_codebook.json'")

# Create a simplified text version for quick reference
with open('female_farmers_codebook_simple.txt', 'w', encoding='utf-8') as f:
    f.write("FEMALE FARMERS DATASET CODEBOOK\n")
    f.write("===============================\n\n")
    f.write(f"Created on: {codebook['dataset_info']['encoded_date']}\n")
    f.write(f"Dataset dimensions: {codebook['dataset_info']['n_rows']} rows × {codebook['dataset_info']['n_columns']} columns\n\n")
    
    f.write("VARIABLE ENCODINGS\n")
    f.write("-----------------\n\n")
    
    f.write("1. Binary Variables (float: 1.0='oui', 0.0='non', NaN=missing):\n")
    for col in [c for c, t in codebook["column_types"].items() if t == "binary"]:
        f.write(f"   - {col}\n")
    f.write("\n")
    
    f.write("2. Tabagisme (special encoding, float: 0.0='non', 1.0='passif', 2.0='oui', NaN=missing):\n")
    f.write("   - Tabagisme\n\n")
    
    f.write("3. Equipment Usage (ordinal, float: 0.0='jamais', 1.0='parfois', 2.0='souvent', 3.0='toujours', NaN=missing):\n")
    for col in [c for c, t in codebook["column_types"].items() if t == "ordinal_equipment"]:
        f.write(f"   - {col}\n")
    f.write("\n")
    
    f.write("4. Profession (ordinal, float, NaN=missing):\n")
    if 'Catégorie professionnelle' in encoded_data.columns:
        f.write("   - Catégorie professionnelle:\n")
        for val, code in codebook["encoding_mappings"]["profession"].items():
            f.write(f"     - {val} → {code}\n")
    f.write("\n")
    
    f.write("5. Categorical Variables (ordinal encoding, float, NaN=missing):\n")
    for category, mapping in codebook["encoding_mappings"]["ordinal_categorical"].items():
        if category in encoded_data.columns:
            f.write(f"   - {category}:\n")
            for val, code in mapping.items():
                f.write(f"     - {val} → {code}\n")
    f.write("\n")
    
    f.write("6. Multi-value Indicators (float: 1.0=present, 0.0=absent, NaN=original missing):\n")
    for orig_col, details in codebook["binary_indicators"].items():
        f.write(f"   - {orig_col} (prefix: {details['prefix']}):\n")
        for indicator_col, value in details["indicators"].items():
            f.write(f"     - {indicator_col} (1.0 = contains '{value}')\n")
    f.write("\n")
    
    f.write("7. One-hot Encoded Variables (int: 0 or 1):\n")
    for orig_col, details in codebook["one_hot_encodings"].items():
        f.write(f"   - {orig_col}:\n")
        for col, value in details["columns"].items():
            f.write(f"     - {col} (1 = '{value}')\n")
        # Handle the missing indicator properly
        missing_indicator = details.get("missing_indicator")
        if missing_indicator:
            f.write(f"     - {missing_indicator} (1 = missing value)\n")
    f.write("\n")
    
    f.write("MISSING VALUE HANDLING\n")
    f.write("---------------------\n\n")
    f.write("- For binary, ordinal, and numerical columns: NaN indicates missing values\n")
    f.write("- For multi-value indicators: if all indicators for a group are NaN, the original value was missing\n")
    f.write("- For one-hot encoded variables: missingness is indicated by the _nan column being 1\n")
    f.write("\n")
    
    f.write("DECODING INSTRUCTIONS\n")
    f.write("--------------------\n\n")
    f.write("To decode after imputation:\n\n")
    f.write("1. For binary variables: map 1.0→'oui', 0.0→'non', NaN→missing\n")
    f.write("2. For Tabagisme: map 0.0→'non', 1.0→'passif', 2.0→'oui', NaN→missing\n")
    f.write("3. For ordinal equipment: map 0.0→'jamais', 1.0→'parfois', 2.0→'souvent', 3.0→'toujours', NaN→missing\n")
    f.write("4. For profession: map reverse of profession mapping, NaN→missing\n")
    f.write("5. For ordinal categorical variables: use the reverse of the mappings shown above, NaN→missing\n")
    f.write("6. For one-hot encoded variables: column with value 1 indicates category, if *_nan=1 then missing\n")
    f.write("7. For multi-value indicators: collect values where indicator=1.0, if all NaN then original was missing\n")

print("\nSimplified codebook also saved to 'female_farmers_codebook_simple.txt'")

Codebook generated and saved to 'female_farmers_codebook.json'

Simplified codebook also saved to 'female_farmers_codebook_simple.txt'
