# Overview
20211207

sarahfong

Managing projects, crafting config files, using config files to build pipelines

# General principles
Lifted from: 

Noble WS (2009) A Quick Guide to Organizing Computational Biology Projects. PLOS Computational Biology 5(7): e1000424. https://doi.org/10.1371/journal.pcbi.1000424

1. Record every operation.
2. Comment. Comment. Comment!
3. Automate everything you can. (Avoid editing by hand).
4. Use config file for storing file/directory names   
5. Use a driver script to control auxiliary scripts. 
6. Experiment scripts | Summary/analysis scripts
7. Use relative paths, not absolute path. 
8. "Restart-ability"


Pro-tip: USE CAPS FOR CONSTANTS

# Project directory 

Project directory should be organized this way:

    ./project/
    
    ./project/src/  # programs
    ./project/data/  # datasets 
    ./project/bin/  # scripts
    ./project/results/  # analyses
    ./project/manuscripts/  # all papers written from this work. 

# Pipelines

## Driver scripts
Driver script: Script that link many auxiliary scripts together into a pipeline

Here, I'll make a pipeline to do GWAS CATALOG LD expansion

Where I have written separate scripts that,
1. Select GWAS variants w/ pvalue < 5e-8
2. LD expand significant GWAS variants

Broadly, I want my driver script to:

Take input: config.ini file 

run scripts using config: clean_gwas.py, LD_expand_gwas.py

output: LD-expanded significant GWAS variants. 


# Config.ini file

## config.ini overview 

Config file

Purpose - store data files, paths 

### Components of .ini files

###### a section is a like a python dictionary, has keys and values 

    [Section]  # case sensitive
    Key=Value  # not case sensitive

#### .ini features

- str document

- key=value | key:value

- key w spaces=allowed

- key=value w spaces allowed

- key wo value allowed

- key=multiline
    value is allowed
    as long as indented further than value's first line 

### examples

#### section + str formatting

#### int and float formatting

#### multiline values

#### empty values

####  comments

[You can use comments]


\# like this

; or this

#### indentations

#### interpolations

# Creating/using config.ini file 

## configparser

Python library to read and write config files

https://docs.python.org/3/library/configparser.html

    pip install configparser
    
    conda install -c anaconda configparser 
    


## Example driver script to write config.ini

In [1]:
import os
import configparser

In [2]:
BASE_PATH = os.getcwd()
configfile_name = os.path.join(BASE_PATH, "config_gwas.ini") # name the file

### call configparser

In [3]:
# Add content to the file
config = configparser.ConfigParser()

### set parameters

In [4]:
# Parameters
POP = "EUR"  # LD panel in 1000G
DL_DATE = "2021-10-25"  # date GWAS catalog was last downloaded. 
SIG_PVAL = "5e-8" # significance threshold

### set paths

In [5]:
# Paths
BASE_PATH = "/".join(os.getcwd().split("/")[:-1]) # base directory level

BIN_PATH = os.path.join(BASE_PATH, "bin")  # where my scripts live
DATA_PATH = os.path.join(BASE_PATH, "data")  # where I dump new data.
RESULTS_PATH = os.path.join(BASE_PATH, "results")  # where I analyze results
SRC_PATH = os.path.join(BASE_PATH, "src")  # where any packages needed to run analyses live. 

### set bins

In [6]:
# Bins

# script to filter for GWAS variants w/ SIG_PVAL
CLEAN_GWAS = os.path.join(BIN_PATH, "clean_gwas.py")

# script to run David Rinker's script
LD_EXPAND = os.path.join(BIN_PATH, "LD_expand_GWAS.py")

# David Rinker's LD-expansion script. 
LD_ASSOC_SCRIPT = os.path.join(BIN_PATH, "assoc2LDSNPs_bed.py")

### writing sections of .ini file

#### Option1: create section w config.add_section(), config.set()

In [7]:
# step 1 - create section

# config.add_section(SECTION)

config.add_section("PATH")

In [8]:
# step 2 - add key, value to section

# config.set(SECTION, key, value)

config.set("PATH", "BIN", BIN_PATH)
config.set("PATH", "DATA", DATA_PATH)
config.set("PATH", "RESULTS", RESULTS_PATH)
config.set("PATH", "SRC", SRC_PATH)

#### Option2: create section as dictionary, piece-by-piece

In [9]:
# config[SECTION][key]=value

config.add_section("PARAMS")

config["PARAMS"]["POP"] = POP
config["PARAMS"]["DL_DATE"] = DL_DATE
config["PARAMS"]["SIG_PVAL"] = SIG_PVAL

#### Option3: create section as dictionary, one-fell-swoop

In [10]:
# config[SECTION] = {
#                    key1:val1,
#                    key2:val2,
#                    key3:val3
# }


config["BIN"] = {
    "CLEAN_GWAS":CLEAN_GWAS,
    "LD_EXPAND":LD_EXPAND,
    "LD_ASSOC_SCRIPT":LD_ASSOC_SCRIPT
}

### set, add files

In [11]:
config["FILE"] = {
        "GWAS_CAT": os.path.join(DATA_PATH, f"gwasCatalog_{DL_DATE}_hg38.bed.gz"),
        "GWAS_CAT_CLEAN":os.path.join(DATA_PATH,f"gwasCatalog_{DL_DATE}_hg38_cleaned_p5e-8.txt"),
        "GWAS_CAT_CLEAN_BED":os.path.join(DATA_PATH,f"gwasCatalog_{DL_DATE}_hg38_cleaned_p5e-8.bed"),
        "GWAS_CLEAN_LD": os.path.join(DATA_PATH, f"gwasCatalog_{DL_DATE}_hg38_cleaned_p5e-8_LD.bed")
        }

### write config file

if-statement: to prevent you from overwriting existing config file

In [12]:
if not os.path.isfile(configfile_name):
    
    with open(configfile_name, 'w') as configfile:
        
        config.write(configfile)  # write the config
        
        configfile.close()

### reading config.ini file

In [13]:
import configparser 

In [14]:
DEV = True

if DEV is True: 
    
    BASE_PATH = os.getcwd()
    configfile_name = os.path.join(BASE_PATH, "config_gwas.ini")
    
else:
    
    # if running a script/pipeline in command line,
    # you can import the config file 
    configfile_name = sys.argv[1]  
    
config = configparser.ConfigParser()
config.read(configfile_name)

['/gpfs51/dors2/capra_lab/users/fongsl/resources/tutorials/using_config_ini/bin/config_gwas.ini']

In [15]:
del configfile_name

#### look at sections 

In [16]:
config.sections()

['PATH', 'PARAMS', 'BIN', 'FILE']

#### Retrieving values

##### get value from section w/ config.get(SECTION, key)

In [17]:
config.get("PATH", "DATA") 

'/gpfs51/dors2/capra_lab/users/fongsl/resources/tutorials/using_config_ini/data'

##### get value from section w/ dictionary-like command

In [18]:
config["PATH"]["DATA"]

'/gpfs51/dors2/capra_lab/users/fongsl/resources/tutorials/using_config_ini/data'

##### remember, values are imported as str!

In [19]:
type(config["PARAMS"]["SIG_PVAL"])

str

##### but you can import value as the datatype you want. 

In [20]:
type(config.getfloat("PARAMS", "SIG_PVAL"))

float

##### import value as something besides str.

In [21]:
# config.getfloat("SECTION", "key")  # returns value as float

# config.getbool("SECTION", "key")  # returns value as bool

# config.getint("SECTION", "key")  # returns value as int