# Parallelizing Exec Scripts in Kaiaulu

## 1. Introduction
This notebook demonstrates how to use the `parallelize.py` script to process multiple files in parallel using Kaiaulu exec scripts like `mailinglist.R` and `git.R`.

The `parallelize.py` script:

- Designed to be generic and configurable via a YAML file.
- Uses placeholders in the command template to customize execution.
- Handles both cases where multiple input files are processed individually and when a single command is run.

In [1]:
!pip install pyyaml



## 2. Configuration File (config.yaml)
Create a config.yaml file to specify parameters for your exec script. Here's how you can set it up for different use cases.

### Example 1: Parsing .mbox Files with mailinglist.R:

``` {yaml}
command_template:
  - "Rscript"
  - "{exec_script}"
  - "parse"
  - "{tools_yml}"
  - "{conf_yml}"
  - "{project_key}"
  - "{output_file}"
  - "{input_file}"
exec_script: path/to/mailinglist.R
tools_yml: path/to/tools.yml
conf_yml: path/to/helix.yml
project_key: project_key_1
input_dir: path/to/mbox_folder
output_dir: path/to/output_folder
input_file_extension: ".mbox"
num_threads: 4
process_individual_files: true
```

### Example 2: Tabulating Git Logs with git.R

``` {yaml}
command_template:
  - "Rscript"
  - "{exec_script}"
  - "tabulate"
  - "{tools_yml}"
  - "{conf_yml}"
  - "{output_file}"
exec_script: path/to/git.R
tools_yml: path/to/tools.yml
conf_yml: path/to/helix.yml
output_dir: path/to/output_folder
num_threads: 1
process_individual_files: false
```

### Running the Script

From the command line: 

```
python parallelize.py config.yaml
```

## 3. Explanation of parallelize.py

Command Template: The command_template in config.yaml specifies the command and its arguments, using placeholders that will be filled in at runtime.

Placeholders:

- {exec_script}: Path to the R exec script.
- {command}: Specific command to run (e.g., parse).
- {tools_yml}, {conf_yml}, {project_key}: Configuration files and project key.
- {output_file}, {input_file}: Paths to the output and input files.
- Process Individual Files: If process_individual_files is true, the script will process each file in input_dir matching input_file_extension in parallel.

Parallel Execution: Uses ThreadPoolExecutor to run tasks concurrently, based on num_threads.

Output Files: Output files are saved in output_dir, with names derived from the input files.

## 4. Adapting parallelize.py for Different Exec Scripts

1. Update command_template: Modify the command_template in config.yaml to match the arguments required by your exec script.

2. Adjust Placeholders: Make sure all necessary placeholders are included and correctly specified.

3. Set process_individual_files:
- true: If your exec script processes individual files (e.g., multiple .mbox files).
- false: If your exec script runs once without individual input files (e.g., processing a git repository).

4. Specify Input and Output Directories: Set input_dir, output_dir, and input_file_extension as needed.

## 5. Modifying parallelize.py to Create parse_mbox.py

In this section, we'll walk through how we adapted parallelize.py to create parse_mbox.py, specifically for parsing .mbox files using the mailinglist.R exec script.

### Step 1: Identify the Specific Requirements
- Process .mbox Files: We need to process multiple .mbox files in parallel.
- Use mailinglist.R: The R script expects certain arguments.
- Pass Relative Paths
### Step 2: Start with parallelize.py Template
The original parallelize.py script is designed to be generic, using a configuration file (config.yaml) to specify parameters.

Key Features of parallelize.py:

- Loads configuration from config.yaml.
- Uses a command_template with placeholders.
- Processes files in parallel using ThreadPoolExecutor.
- Handles both individual file processing and single command execution.
### Step 3: Modify the Configuration File
For parse_mbox.py, we need to adjust config.yaml to match our specific use case.

Updated config.yaml:
``` {yaml}
command_template:
  - "Rscript"
  - "{exec_script}"
  - "parse"
  - "{tools_yml}"
  - "{conf_yml}"
  - "{project_key}"
  - "{output_file}"
  - "{input_file}"
exec_script: "../../exec/mailinglist.R"
tools_yml: "../../tools.yml"
conf_yml: "../../conf/helix.yml"
project_key: "project_key_1"
input_dir: "../../../rawdata/helix/mod_mbox/save_mbox_mail"
output_dir: "../../../rawdata/helix/mod_mbox/parsed_mbox_mail"
input_file_extension: ".mbox"
num_threads: 4
process_individual_files: true
```
- command_template: Adjusted to include the specific command (parse) and arguments required by mailinglist.R.
- Paths: Set to match the directory structure of the project, using relative paths.
- process_individual_files: Set to true because we are processing multiple .mbox files.

### Step 4: Adjust the Python Script
We modified parallelize.py to create parse_mbox.py with the following changes:

- Removed Unnecessary Generalizations: Since parse_mbox.py is specific to parsing .mbox files, we can simplify the script.
- Paths: Adjusted the script to pass the correct paths to the R script.
- Customized the Command Execution: Edited the run_r_parse_mbox function to fit our use case.

Key Modifications:

Load Configuration: Kept the configuration loading mechanism to read parameters from config.yaml.
Compute Relative Paths: Used os.path.relpath() to pass relative paths to the R script.
Process Individual Files: Ensured the script processes each .mbox file in the input directory.

Now, run the script:

``` python parse_mbox.py ```