# *A. halleri* Bayesian Causal Network Analysis 

This analysis uses the [causalnex](https://causalnex.readthedocs.io/) python package. This is an explanation of code that has only been tested in a `PyTorch` Docker container image on a server with research grade GPU's and >250 cores.

## Import Python Libraries

In [None]:
# import pandas and numpy
import pandas as pd
import numpy as np

# setup label encoder to transform non-numeric data into numeric data
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

# silence warnings
import warnings
warnings.filterwarnings("ignore")

# import StructureModel 
from causalnex.structure import StructureModel
sm = StructureModel()


## *A. halleri Data Input and Formatting*

### Read in phenotype and environment data

These data are not normalized or scaled in anyway.

In [None]:
data = pd.read_csv('~/a_halleri/data/Ahalleri_transplant_exp_F.txt', delimiter='\t')
data.head()

Unnamed: 0,sample,origin_pop,origin_type,treatment_pop,treatment_type,code,comparison_type,comparison_pop,comparison_3levels,F_FW_rosette,F_FW_aboveground_tot,F_FW_root,F_DW_rosette,F_DW_aboveground_tot,F_DW_root,F_RGR_max,F_WC1,F_WC2,F_LMA,F_fvfm,F_PIabs,F_Zn_S,F_Cd_S,F_Pb_S,F_Mg_S,F_Ca_S,F_K_S,F_Na_S,F_Cu_S,F_Fe_S,F_P_S,F_Zn_R,F_Cd_R,F_Pb_R,F_Mg_R,F_Ca_R,F_K_R,F_Na_R,F_Cu_R,F_Fe_R,...,F_stem_nr,F_max_stem_length,F_epid_up,F_mes,F_epid_low,F_whole_leaf,F_cuti_up,F_cuti_low,F_stom_size_up_la,F_stom_size_up_sa,F_stom_area_up,F_stom_size_low_la,F_stom_size_low_sa,F_stom_area_low,F_stom_dens_low,F_stom_dens_up,F_palisade_long,F_palisade_short,F_trich_dens_up,F_trich_dens_low,F_whole_leaf_added,F_epid_up_propa,F_mes_propa,F_epid_low_propa,F_protection,F_assim_protection_prop,F_epid_up_propb,F_mes_propb,F_epid_low_propb,F_palisade_index,F_pollen_n_alive,F_pollen_n_dead,F_pollen_n_degen,F_pollen_n_deaddegen,F_pollen_n_all,F_pollen_viability_perc,F_pollen_dead_perc,F_pollen_degen_perc,F_pollen_deaddegen_perc,F_pollen_diameter
0,M_PL22_101,M_PL22,M,M_PL22,M,M_PL22_101_M_PL22,sympatric,sympatric,sympatric,2.551,2.551,1.125,0.3817,0.3817,0.173,0.023657,73.209549,2.733,0.004552,0.82,2.849,9031.0,321.0,314.4,6230.0,19647.0,16208.0,42.0,10.7,4233.012,0.46,4054.0,243.8,215.7,2023.0,6036.0,8298.0,38.0,9.3,1789.5455,...,20.0,31.333333,19.350001,156.485001,15.03,183.93,0.65,0.59,14.584167,13.699167,157.798842,15.5,13.8775,169.62928,332.633053,350.140056,39.814999,24.305001,0.0,0.0,190.865002,0.105203,0.850786,0.081716,34.380001,4.551629,0.101381,0.819873,0.078747,1.63814,327.0,2.0,0.0,2.0,329.0,99.39,0.61,0.0,0.607903,15.54
1,M_PL22_101,M_PL22,M,M_PL27,M,M_PL22_101_M_PL27,sympatric,allopatric,near_allopatric,19.641,31.112,1.769,2.496,5.285,0.3765,0.04514,84.513619,5.457,0.001853,0.776833,3.061,19253.0,296.0,23.0,6186.0,17412.0,16289.0,16.0,3.1,420.074,0.19,4642.0,120.9,70.9,2216.0,3847.0,14669.0,47.0,4.6,114.2325,...,17.333333,63.333333,24.54,125.355003,16.145001,155.93,0.8,0.535,16.166667,14.335833,182.459788,17.954167,14.723333,208.017874,175.070028,297.619048,30.505001,22.565001,0.0,0.0,166.040004,0.157378,0.803918,0.10354,40.685001,3.081111,0.147796,0.754969,0.097236,1.351872,432.0,7.0,1.0,8.0,440.0,98.18,1.59,0.23,1.818182,16.653333
2,M_PL22_101,M_PL22,M,NM_PL14,NM,M_PL22_101_NM_PL14,allopatric,allopatric,far_allopatric,7.409,7.409,0.541,0.747,0.747,0.167,0.02686,88.176796,7.458,0.00175,0.841,2.9883,12013.0,248.3,7.8,5807.0,21301.0,18591.0,148.0,8.9,3496.089,0.51,3316.0,139.9,2.1,1352.0,4300.0,15466.0,45.0,8.9,1426.211,...,28.333333,62.666667,21.194999,110.145,19.055001,144.290001,0.565,0.46,16.595,14.405,187.869031,15.965,13.64,171.136657,210.084034,420.168067,38.129999,31.639999,0.0,0.0,150.395,0.146892,0.763359,0.13206,40.25,2.736522,0.140929,0.732371,0.1267,1.20512,540.0,7.0,0.0,7.0,547.0,98.72,1.28,0.0,1.279707,18.22
3,M_PL22_101,M_PL22,M,NM_PL35,NM,M_PL22_101_NM_PL35,allopatric,allopatric,far_allopatric,4.232,8.113,3.653,0.8914,1.7967,0.8232,0.019424,78.797468,3.716,,0.814667,2.4203,1891.0,10.9,1.6,1481.0,3511.0,13426.0,33.0,2.9,1184.573,0.26,1720.0,6.8,1.2,921.0,3030.0,9392.0,34.0,4.8,422.256,...,17.333333,24.666667,19.715001,78.86,15.875,103.73,0.6,0.49,16.58,13.993333,183.050557,17.576666,15.289167,211.835325,297.619048,262.605042,33.049999,22.875,0.0,0.0,114.450001,0.190061,0.760243,0.153042,35.590001,2.215791,0.172259,0.689035,0.138707,1.444809,866.0,7.0,3.0,10.0,876.0,98.86,0.8,0.34,1.141553,17.278333
4,M_PL22_102,M_PL22,M,M_PL22,M,M_PL22_102_M_PL22,sympatric,sympatric,sympatric,,,,,,,0.016111,,,,0.7915,2.2155,,,,,,,,,,,,,,,,,,,,...,8.0,21.333333,,,,,,,,,,,,,,,,,,,,,,,,,,,,,687.0,3.0,2.0,5.0,692.0,99.28,0.43,0.29,0.722543,16.65


### Drop features from Model

1. Drop `code` from the model, this is internal to the researchers who collected the data.
2. Drop `comparison_type` and `comparison_pop`, by one-hot encoding `comparison_3levels` the new features account for the dropped features in their comparison.
3. Use `data` to begin encoding categorical variables.

In [None]:
drop_col = ['code', 'comparison_type', 'comparison_pop']
data = data.drop(columns=drop_col)
data.head(5)

Unnamed: 0,sample,origin_pop,origin_type,treatment_pop,treatment_type,comparison_3levels,F_FW_rosette,F_FW_aboveground_tot,F_FW_root,F_DW_rosette,F_DW_aboveground_tot,F_DW_root,F_RGR_max,F_WC1,F_WC2,F_LMA,F_fvfm,F_PIabs,F_Zn_S,F_Cd_S,F_Pb_S,F_Mg_S,F_Ca_S,F_K_S,F_Na_S,F_Cu_S,F_Fe_S,F_P_S,F_Zn_R,F_Cd_R,F_Pb_R,F_Mg_R,F_Ca_R,F_K_R,F_Na_R,F_Cu_R,F_Fe_R,F_P_R,F_rosette_hight,F_stem_nr,F_max_stem_length,F_epid_up,F_mes,F_epid_low,F_whole_leaf,F_cuti_up,F_cuti_low,F_stom_size_up_la,F_stom_size_up_sa,F_stom_area_up,F_stom_size_low_la,F_stom_size_low_sa,F_stom_area_low,F_stom_dens_low,F_stom_dens_up,F_palisade_long,F_palisade_short,F_trich_dens_up,F_trich_dens_low,F_whole_leaf_added,F_epid_up_propa,F_mes_propa,F_epid_low_propa,F_protection,F_assim_protection_prop,F_epid_up_propb,F_mes_propb,F_epid_low_propb,F_palisade_index,F_pollen_n_alive,F_pollen_n_dead,F_pollen_n_degen,F_pollen_n_deaddegen,F_pollen_n_all,F_pollen_viability_perc,F_pollen_dead_perc,F_pollen_degen_perc,F_pollen_deaddegen_perc,F_pollen_diameter
0,M_PL22_101,M_PL22,M,M_PL22,M,sympatric,2.551,2.551,1.125,0.3817,0.3817,0.173,0.023657,73.209549,2.733,0.004552,0.82,2.849,9031.0,321.0,314.4,6230.0,19647.0,16208.0,42.0,10.7,4233.012,0.46,4054.0,243.8,215.7,2023.0,6036.0,8298.0,38.0,9.3,1789.5455,0.32,1.0,20.0,31.333333,19.350001,156.485001,15.03,183.93,0.65,0.59,14.584167,13.699167,157.798842,15.5,13.8775,169.62928,332.633053,350.140056,39.814999,24.305001,0.0,0.0,190.865002,0.105203,0.850786,0.081716,34.380001,4.551629,0.101381,0.819873,0.078747,1.63814,327.0,2.0,0.0,2.0,329.0,99.39,0.61,0.0,0.607903,15.54
1,M_PL22_101,M_PL22,M,M_PL27,M,near_allopatric,19.641,31.112,1.769,2.496,5.285,0.3765,0.04514,84.513619,5.457,0.001853,0.776833,3.061,19253.0,296.0,23.0,6186.0,17412.0,16289.0,16.0,3.1,420.074,0.19,4642.0,120.9,70.9,2216.0,3847.0,14669.0,47.0,4.6,114.2325,0.14,1.666667,17.333333,63.333333,24.54,125.355003,16.145001,155.93,0.8,0.535,16.166667,14.335833,182.459788,17.954167,14.723333,208.017874,175.070028,297.619048,30.505001,22.565001,0.0,0.0,166.040004,0.157378,0.803918,0.10354,40.685001,3.081111,0.147796,0.754969,0.097236,1.351872,432.0,7.0,1.0,8.0,440.0,98.18,1.59,0.23,1.818182,16.653333
2,M_PL22_101,M_PL22,M,NM_PL14,NM,far_allopatric,7.409,7.409,0.541,0.747,0.747,0.167,0.02686,88.176796,7.458,0.00175,0.841,2.9883,12013.0,248.3,7.8,5807.0,21301.0,18591.0,148.0,8.9,3496.089,0.51,3316.0,139.9,2.1,1352.0,4300.0,15466.0,45.0,8.9,1426.211,0.69,0.0,28.333333,62.666667,21.194999,110.145,19.055001,144.290001,0.565,0.46,16.595,14.405,187.869031,15.965,13.64,171.136657,210.084034,420.168067,38.129999,31.639999,0.0,0.0,150.395,0.146892,0.763359,0.13206,40.25,2.736522,0.140929,0.732371,0.1267,1.20512,540.0,7.0,0.0,7.0,547.0,98.72,1.28,0.0,1.279707,18.22
3,M_PL22_101,M_PL22,M,NM_PL35,NM,far_allopatric,4.232,8.113,3.653,0.8914,1.7967,0.8232,0.019424,78.797468,3.716,,0.814667,2.4203,1891.0,10.9,1.6,1481.0,3511.0,13426.0,33.0,2.9,1184.573,0.26,1720.0,6.8,1.2,921.0,3030.0,9392.0,34.0,4.8,422.256,0.21,1.0,17.333333,24.666667,19.715001,78.86,15.875,103.73,0.6,0.49,16.58,13.993333,183.050557,17.576666,15.289167,211.835325,297.619048,262.605042,33.049999,22.875,0.0,0.0,114.450001,0.190061,0.760243,0.153042,35.590001,2.215791,0.172259,0.689035,0.138707,1.444809,866.0,7.0,3.0,10.0,876.0,98.86,0.8,0.34,1.141553,17.278333
4,M_PL22_102,M_PL22,M,M_PL22,M,sympatric,,,,,,,0.016111,,,,0.7915,2.2155,,,,,,,,,,,,,,,,,,,,,1.333333,8.0,21.333333,,,,,,,,,,,,,,,,,,,,,,,,,,,,,687.0,3.0,2.0,5.0,692.0,99.28,0.43,0.29,0.722543,16.65


### Transform Categorical Variables to Numeric Ones

1. Copy the `pandas` dataframe `data` to `struct_data` for `NOTEARS` preprocessing.
2. Remove `NA` values
3. Rename `sample` to `genotype` to not conflict with the python function `sample()`.
4. Encode categoricals with `>3` categories as dummy variables. 
  * Those are: `genotype`, `origin_pop`, `treatment_pop`, and `comparison_3levels`
  * The output of this encoding is stored in the dataframe `dum_df` 
5. Use the `labelEncoder()` from `sklearn.preprocessing` to convert all two category variables to binary. 
  * `M` is encoded as a `0` and `NM` is encoded as a `1`
6. 

In [None]:
from sklearn.preprocessing import LabelEncoder
struct_data = data.copy()

# drop na rows 
struct_data = struct_data.dropna()
# change sample to genotype to not interfere with code by invoking sample()
struct_data = struct_data.rename(columns={"sample": "genotype"})

#encode non-binary categorical variables as dummy vars
dum_df = pd.get_dummies(struct_data, columns=['genotype', 'origin_pop',
                                                  'treatment_pop',
                                                  'comparison_3levels'])
# encode binary categorical variables as 0's or 1's
non_numeric_columns = list(dum_df.select_dtypes(exclude=[np.number]).columns)
le = LabelEncoder()
for col in non_numeric_columns:
  dum_df[col] = le.fit_transform(dum_df[col])

dum_df.head(5)

Unnamed: 0,origin_type,treatment_type,F_FW_rosette,F_FW_aboveground_tot,F_FW_root,F_DW_rosette,F_DW_aboveground_tot,F_DW_root,F_RGR_max,F_WC1,F_WC2,F_LMA,F_fvfm,F_PIabs,F_Zn_S,F_Cd_S,F_Pb_S,F_Mg_S,F_Ca_S,F_K_S,F_Na_S,F_Cu_S,F_Fe_S,F_P_S,F_Zn_R,F_Cd_R,F_Pb_R,F_Mg_R,F_Ca_R,F_K_R,F_Na_R,F_Cu_R,F_Fe_R,F_P_R,F_rosette_hight,F_stem_nr,F_max_stem_length,F_epid_up,F_mes,F_epid_low,...,genotype_M_PL27_102,genotype_M_PL27_103,genotype_M_PL27_104,genotype_M_PL27_105,genotype_M_PL27_106,genotype_M_PL27_107,genotype_M_PL27_108,genotype_M_PL27_109,genotype_M_PL27_110,genotype_NM_PL14_105,genotype_NM_PL14_111,genotype_NM_PL14_114,genotype_NM_PL14_118,genotype_NM_PL14_119,genotype_NM_PL14_121,genotype_NM_PL14_126,genotype_NM_PL14_127,genotype_NM_PL14_128,genotype_NM_PL14_130,genotype_NM_PL35_101,genotype_NM_PL35_103,genotype_NM_PL35_104,genotype_NM_PL35_105,genotype_NM_PL35_106,genotype_NM_PL35_107,genotype_NM_PL35_108,genotype_NM_PL35_109,genotype_NM_PL35_110,genotype_NM_PL35_111,origin_pop_M_PL22,origin_pop_M_PL27,origin_pop_NM_PL14,origin_pop_NM_PL35,treatment_pop_M_PL22,treatment_pop_M_PL27,treatment_pop_NM_PL14,treatment_pop_NM_PL35,comparison_3levels_far_allopatric,comparison_3levels_near_allopatric,comparison_3levels_sympatric
0,0,0,2.551,2.551,1.125,0.3817,0.3817,0.173,0.023657,73.209549,2.733,0.004552,0.82,2.849,9031.0,321.0,314.4,6230.0,19647.0,16208.0,42.0,10.7,4233.012,0.46,4054.0,243.8,215.7,2023.0,6036.0,8298.0,38.0,9.3,1789.5455,0.32,1.0,20.0,31.333333,19.350001,156.485001,15.03,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1
1,0,0,19.641,31.112,1.769,2.496,5.285,0.3765,0.04514,84.513619,5.457,0.001853,0.776833,3.061,19253.0,296.0,23.0,6186.0,17412.0,16289.0,16.0,3.1,420.074,0.19,4642.0,120.9,70.9,2216.0,3847.0,14669.0,47.0,4.6,114.2325,0.14,1.666667,17.333333,63.333333,24.54,125.355003,16.145001,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0
2,0,1,7.409,7.409,0.541,0.747,0.747,0.167,0.02686,88.176796,7.458,0.00175,0.841,2.9883,12013.0,248.3,7.8,5807.0,21301.0,18591.0,148.0,8.9,3496.089,0.51,3316.0,139.9,2.1,1352.0,4300.0,15466.0,45.0,8.9,1426.211,0.69,0.0,28.333333,62.666667,21.194999,110.145,19.055001,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0
6,0,1,3.618,11.743,0.628,0.406,2.148,0.189,0.0343,91.346154,10.556,0.004513,0.819667,2.4877,6832.0,139.2,3.0,6525.0,19273.0,21655.0,66.0,7.7,808.953,0.54,2467.0,56.6,3.4,1571.0,4742.0,14851.0,60.0,13.5,919.3795,0.6,0.0,21.333333,57.333333,15.255,79.32,12.855,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0
7,0,1,3.208,3.208,1.467,0.8748,0.8748,0.293,0.023917,82.474227,4.706,0.0058,0.767,1.4732,1385.0,22.7,1.9,1968.0,4097.0,13636.0,46.0,4.5,1521.641,0.22,1620.0,13.9,0.6,2314.0,4894.0,10974.0,82.0,6.4,169.53,0.18,1.833333,28.333333,28.0,26.17,147.114998,16.9,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0


## Structure Learning with `NOTEARS`

The function `causalnex.structure.notears.from_pandas()` has the following arguments:

```python
X (DataFrame) – input data.
max_iter (int) – max number of dual ascent steps during optimisation.
h_tol (float) – exit if h(W) < h_tol (as opposed to strict definition of 0).
w_threshold (float) – fixed threshold for absolute edge weights.
tabu_edges (Optional[List[Tuple[str, str]]]) – list of edges(from, to) not to be included in the graph.
tabu_parent_nodes (Optional[List[str]]) – list of nodes banned from being a parent of any other nodes.
tabu_child_nodes (Optional[List[str]]) – list of nodes banned from being a child of any other nodes.
```

The `NOTEARS` algorithm has the following references:

```latex
@inproceedings{zheng2018dags,
    author = {Zheng, Xun and Aragam, Bryon and Ravikumar, Pradeep and Xing, Eric P.},
    booktitle = {Advances in Neural Information Processing Systems},
    title = {{DAGs with NO TEARS: Continuous Optimization for Structure Learning}},
    year = {2018}
}

@inproceedings{zheng2020learning,
    author = {Zheng, Xun and Dan, Chen and Aragam, Bryon and Ravikumar, Pradeep and Xing, Eric P.},
    booktitle = {International Conference on Artificial Intelligence and Statistics},
    title = {{Learning sparse nonparametric DAGs}},
    year = {2020}
}
```

### Using `NOTEARS`: Generate Directed Acyclic Graph (DAG) of *A halleri* Data 

`max_iter`: maximum number of iterations to run NOTEARS
`w_threshold`: sets absolute edge weight


In [None]:
from causalnex.structure.notears import from_pandas
sm = from_pandas(X=dum_df, max_iter=1000, w_threshold=0.8)

### Visualize Graph Structure

Creates DAG with `plot_structure` function from CausalNex.

In [None]:
from causalnex.plots import plot_structure
plot = plot_structure(sm)
plot.draw("sm_plot.png")

# Current Progress

As of Apr 06 2021, the learning takes ~4 hours with 125 features and 250 cores on the CyVerse Developer Server. The graph generation throws an error message:

```bash
Exception ignored in: <function AGraph.__del__ at 0x7f49dabff430>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/pygraphviz/agraph.py", line 283, in __del__
  File "/opt/conda/lib/python3.8/site-packages/pygraphviz/agraph.py", line 1000, in _close_handle
TypeError: 'NoneType' object is not callable
```

But the graph still output as `sm_plot.png`

# Potential Next Steps:

- [ ] Fit the Conditional Distribution of the Bayesian Network
  - Requires `train/test` split with `sklearn`
- [ ] Assess Model Quality
- [ ] Query the Marginals
