#### The pipeline to analyse Asteraceae genomes using Asteraceae synteny-phylogenomic framework

## ACG (Asteraceae Comparative Genomics)
#### The objective of this pipeline is to use the AGB framework (Asteraceae Genome Blocks defined in our paper[link]) to carry out comparative genomic study in the biggest Angiosperm family --Asteraceae
1. Mapping asteraceae onto the 15*3 blocks
2. Obtain the homologous gene groups in whcih we have syntenic genes, dispersed genes, tandem duplicated genes.
3. Use the syntenic genes as anthors, we apply the graph-based syntenic blocks identification algorithm (drimm-synteny) to identify syntenic segments
### before we start drimm-synteny anaslysis, we need to generate synOG table using genespace_parser()
    refer to genespace_parser.ipynb
    be carefull about how speciese name and gene name are connected ("@" or "_")
1. use “processOrthofinder_tao.py” to process the genespace pangenome output to get “drimm.sequence” —> be careful about the copy_ratio

2. use “drimm-synteny” on windows to built up drimm blocks. —→ this current only works on windows, we can use it on unix with Wine software, but on Merian, I have issues with X11 library which can not find by Wine

3. use “processDrimm_tao.py” on the outputs from last step to obtain final blocks for IAGS

4. use IAGS to construct ancestral genomes; I also tried mgra (by index and reformat the blocks) but it doesn’t work.
    
from here we can use either iags <option -ff> or grimm to calculate genome shuffling event’s:
    
5. ***a***) use “standardize_drimmBlocks.py” to index and reformat the drimm genomes including ancestors and also extent genomes; then use grimm to characterise genome rearrangements; ***b***) use iags to calculate directly

6. plotting and interpretation using iags (option -p); before this we need use “processGenenumber_tao.py” to generate the data needed by iags

In [None]:
# Define the source paths for the files
riparianPath <- file.path("{genespaceDir}", "riparian")
pangenePath <- file.path("{genespaceDir}", "pangenes")

# List files with .pdf and _synOG.txt extensions
riparian_files <- list.files(riparianPath, pattern = "\\\\.pdf$", full.names = TRUE)
pangene_files <- list.files(pangenePath, pattern = "_synOG\\\\.txt$", full.names = TRUE)

# Define the destination folder for macrosynteny
macrosynteny <- file.path("{workDir}", "results")

# Create the destination directory if it doesn't exist
if (!dir.exists(macrosynteny)) {{
    dir.create(macrosynteny, recursive = TRUE)
}}

# Check if both expected file types exist and copy them
if (length(riparian_files) > 0 & length(pangene_files) > 0) {{
    file.copy(riparian_files, macrosynteny, overwrite = TRUE)
    file.copy(pangene_files, macrosynteny, overwrite = TRUE)
    print("Files copied successfully!")
}} else {{
    print("The riparian or pangene files don't exist, please check whether the Genespace run was successful...")
}}
    

In [None]:
# get dependencies ready
import os,shutil
import pandas as pd
import numpy as np
from pathlib import Path
import scripts.makeGenomeInfo as sGI
import scripts.processDrimm as spd
import scripts.processOG as spo
import scripts.modules as m
import scripts.configure as c

In [None]:
# initialize working environment
packagedir = "/home/feng041/scripts/ipynb/AGB_pipeline6/"
workdir = "/home/feng041/project/1_asteraceae_phylogenomics/test_AGBpipeline/"
drimmPath = '/home/feng041//scripts/ipynb/AGB_pipeline6/software/DRIMM-Synteny'

if not workdir[-1] == "/": workdir += "/"

info = c.configure(workdir,packagedir)
## info = (genespaceDir,resultsDir,meta,index,bedDir2,pepDir2)
## this will check the depencies and the inputs, and make folders for outputs
## this will also generate a R script "run_genespace.R" under your workDir

In [None]:
## as Running genespace will take some time, please run genespace outside this pipeline, for example in your terminal with correct conda environment:
conda activate asteraceae_phylogenomics
nohup R < run_genespace.R --no-save > nohup_genespace &
# this will run genespace at background, if everything goes well, several files will be generated

In [None]:
# run GENESPACE and get the results
## we recommend to include the species and use ANC as an outgroup to get the best hirochcal OGs
## namely prepare your genome data, pep and bed according to GENESPACE, put them in the corresponding folder, run Genespace
##! currently, the chromosome name in AGB.bed is "a1", not "a01", while in the module makeGenomeInfo.make_index, the "a01" will be used, need to adjust this
# get input data ready

#info = (genespaceDir,resultsDir,meta,index,bedDir2,pepDir2)
resultsDir = info[1]
genome_meta = info[2]
index_folder = info[3]
bed_folder = info[4]
pep_folder = info[5]

pangenome = resultsDir + "AGB_synOG.txt"

## let's first clean the pangenome table from Genespace;
clean_pangenome = m.pangenome_cleaner(pangenome)
## let's then parse the cleaned pangenome
orthogroups = m.parse_pangenome(pangenome,clean_pangenome)
# this is a tuble of the path to the outputs: *_all, *_syn, and *_synteny; *_syn will be used for the pipeline

# Step 2. Get orthogroups ready
ortho = pd.read_csv(orthogroups[1], sep='\t', dtype = str, keep_default_na=False)
ortho2 = ortho[ortho["interpChr"] != ""]

In [None]:
# Step 1. First, let's prepare inputs for the pipeline
## make index file
df_index = sGI.make_index(index_folder) #your genome index and genome name you want to use

## make bed, genome_meta, chromosome_meta
genomeMeta = sGI.make_genome_meta(genome_meta) # if you want to exclude some species, use '#' to annotate out this species
#print(genomeMeta)
chromosome_meta = sGI.make_chromosome_meta(bed_folder,DIC)
#print(chromosome_meta)

DIC = {f"{row['sp']}@{row['chr']}": row['length'] for _, row in df_index.iterrows()}
bed_dir = resultsDir + "drimmBED/"
if not os.path.exists(bed_dir):
    os.makedirs(bed_dir)
    sGI.make_drimmBED(bed_folder, DIC, bed_dir)
else:
    print(bed_dir + " exists, let's generate bed file in this floder")
    sGI.make_drimmBED(bed_folder, DIC, bed_dir)

In [None]:
### Step2 perform drimm-synteny to call synteny blocks
## process synOG table, call pairwise orthologs for given ref-target pair

# for pairwise comparasion
reference = "AGB"
queries = genomeMeta['sp'].to_list()
genome_meta_dict = genomeMeta.set_index('sp')['chrN'].to_dict()
chromosome_meta_dict = {lst[0]: lst[1:] for lst in chromosome_meta}
for species in queries:
    if not species == reference:
        species_list = [reference, species]
        subdir = resultsDir + species + "/1_SynOG/" # 1. this is for two species, if we want to for multiple species, we need to change here
        info_list = m.prepare_meta(species_list, genomeMeta, chromosome_meta)
        sp,sp_ratio,sp_chr_number,long_chr_list,gff_list = info_list # this decides species pair
        #print(info_list)
        #print(sp)
        #print(sp_chr_number)
        # step 1, let's process cleaned pangene table to generate drimm.sequence for drimm-synteny
        if not Path(subdir).exists():
            os.makedirs(subdir)
            # for genespace output
            ortho3 = m.prepare_OG(ortho2,sp) # 2. here, the arguement 'sp' deciedes pair-wise or multiple-species comparasion
            #print(ortho3)
            '''
            in prepare_OG() (modules.py line90) deals with ploidy genomes, namely in ortho2, there could be one/many to many
            e.g. ref@Gene1   query@Gene1|query@Gene2
            '''
            group_dir = spo.get_group_dir(ortho3,sp) # 3. and here, the arguement 'sp' deciedes pair-wise or multiple-species comparasion 
            finalGroup = spo.get_final_group(group_dir,sp_ratio)
            spo.processSynOG(bed_dir,subdir,group_dir,finalGroup,gff_list,long_chr_list)
        else: print("SynOG for " + species + " exists, we will use this for Drimm-Synteny; If this is not correct, please delete 1_SynOG/ and rerun ...")
        
        # step 2, let's process drimm.sequence to generate row drimm block
        print("\n=============================================================\nlet's process orthologs to generate row blocks\n=============================================================\n)
        drimmSequence = subdir + "drimm.sequence"
        sp_ploidy = dict(zip(genomeMeta['sp'], genomeMeta['ploidy']))
        dustThreshold = sp_ploidy[species] + sp_ploidy[reference] + 1
        outPath = resultsDir + species + "/2_DrimmRaw/"
        
        if not Path(outPath).exists():
            os.makedirs(outPath)
            m.drimmSynteny(drimmSequence,drimmPath,outPath,dustThreshold)
            print("drimm-synteny done!")
        else:
            print("Drimm-Synteny outpath exists!!, we will check if blocks has been built ...")
            Block_File = outPath + "blocks.txt"
            if Path(Block_File).exists() and Path(Block_File).stat().st_size > 0:
                print("A block file seems has been generated, we will use this for next step; If this is not the correct, please delete the block file and rerun ...")
            else:
                m.drimmSynteny(drimmSequence,drimmPath,outPath,dustThreshold)
                print("drimm-synteny done!")

        
        # step 3. let's process row drimm blocks to generate final blocks
        print("\n=============================================================\nlet's process raw drimm blocks ...\n=============================================================\n")
        spd.main(info_list,species,resultsDir)
        
        print("\n=============================================================\nlet's generate cleaned blocks ...\n=============================================================\n")
        ## let's first make a combed file based on normal bed, this special bed will be used by combed_parser() and grimmBlock_parser()
        comBed = m.make_comBed(bed_folder)
        m.synBlock(resultsDir, species, comBed)
        
        print("\n=============================================================\nlet's generate data for chromosome painting ...\n=============================================================\n")
        m.chroBlock(resultsDir, species, reference, chromosome_meta_dict, genome_meta_dict)

In [None]:
# Chracterize genome rearrangements
# there are two options:
## 1. iags approach (fission, fusion), please follow iags guide; the output from step 3 can be used directly
### 1.1 to wrapper iags

## 2. Grimm approach (fission, fusion, inversion, translocation); 
### 2.1 to use grimm, we need to manipulate the data to fit grimm; this will generate grimm format genome <4_GrimmBlocks>;
### 2.2 be careful with the parameter <ratio> which aims to handle duplicated blocks and single blocks independently

'''
the drimm format genomes (finalBlocks) from processDrimm use geneIDs that not started from 1, which is not like by grimm and mgra
here, we re-code the geneID from 1 to x
'''
drimm_genome_path = "/Users/fengtao/Library/CloudStorage/OneDrive-WageningenUniversity&Research/project/Asteraceae_evolution/2_analysis/10_ancestralGenome/genespace26/refSp03v1"
#drimm_ancestor_genome_path = "/Users/fengtao/Library/Group Containers/G69SCX94XU.duck/Library/Application Support/duck/Volumes.noindex/myers/project/1_asteraceae_phylogenomics/4_ancestralGenome/iags"
#drimm_genomes_new = "/Users/fengtao/Library/Group Containers/G69SCX94XU.duck/Library/Application Support/duck/Volumes.noindex/myers/project/1_asteraceae_phylogenomics/4_ancestralGenome/extent_indexed.drimm"
#list = ["sp03v1","sp03v5","sp15","sp07","sp03","sp41","sp13","sp11","sp29","sp30","sp22","sp21","sp14","sp10","sp08","sp35"]
list = ["sp03v1","sp03","sp07","sp10","sp13","sp15","sp21","sp29","sp35","sp03v5","sp08","sp11","sp14","sp18","sp22","sp30","sp41"] # be carefully, the referrence genome needs to be in the list
ratio = 2 # the ratio of blocks between target and query genomes
genome_list = [genome + "_" + str(ratio) for genome in list]
print(genome_list)

# to gather the genomes we want to include, sometimes we want only analysis some geneomes, for example drimm consider ancestor to descent, mgra takes only extant genomes
if not drimm_genome_path[-1] == "/": drimm_genome_path += "/"
for file in os.listdir(drimm_genome_path):
    print(file)
    genome_path = []
    if file in list:
        path = drimm_genome_path + file + "/3_DrimmBlocks/finalBlocks/"
        if not os.path.exists(drimm_genome_path + file + "/4_GrimmBlocks"): os.mkdir(drimm_genome_path + file + "/4_GrimmBlocks")
        grimmBlocks = drimm_genome_path + file + "/4_GrimmBlocks/grimm_" + str(ratio) + ".block"
        for file2 in os.listdir(path):
            if file2.endswith("_" + str(ratio) + ".final.block"):
                genome_path.append(path + file2)
        print(genome_path)
        standarlize_drimmblocks(genome_list, genome_path, grimmBlocks)

### 2.3 go to grimm to calculate genome shufflings