# Notebook for re-annotation of Ash1

### One time preprocess
After these steps are complete, the processed files are:
- `GRCh38_canonical_primary.fna`
- `GRCh38_primary_scaffolds.txt`
- `CHM13_canonical.fna`
- `Ash1_canonical_VDJ-masked.fna`
- `hybrid.fna`

In [None]:
%%bash
fastatools makecanonical -fi GRCh38.fna -mapfile GRCh38_mapfile.csv
fastatools makecanonical -fi CHM13.fna  -mapfile CHM13_mapfile.csv
fastatools makecanonical -fi Ash1.fna   -mapfile Ash1_mapfile.csv
fastatools keepprimary   -fi GRCh38_canonical.fna -listfo GRCh38_primary_scaffolds.txt
fastatools replacechrY   -f1 GRCh38_canonical_primary.fna -f2 CHM13_canonical.fna -fo hybrid.fna
bedtools   maskfasta     -fi Ash1_canonical.fna -bed VDJ_coords.bed -fo Ash1_canonical_VDJ-masked.fna

### Preprocess files (call from Refseq/CHESS directory)
After these steps are complete, the processed files are:
- `GRCh38_canonical_primary_VDJ-rDNA-chrY-discarded.gff`
- `CHM13_canonical_chrY-only.gff`
- `CHM13_canonical_rDNA-only_with-roots.gff`
- `hybrid.gff`

In [None]:
%%bash
fastatools makecanonical -fi GRCh38.gff -mapfile ../fasta/GRCh38_mapfile.csv
fastatools makecanonical -fi CHM13.gff -mapfile ../fasta/CHM13_mapfile.csv
fastatools keepprimary   -fi GRCh38_canonical.gff -list ../fasta/GRCh38_primary_scaffolds.txt
fastatools gffdiscard -fi GRCh38_canonical_primary.gff -VDJ -rDNA -chrY
fastatools gffextract -fi CHM13_canonical.gff -chrY
fastatools gffextract -fi CHM13_canonical.gff -rDNA
fastatools rDNAaddparent -fi CHM13_canonical_rDNA-only.gff -fo CHM13_canonical_rDNA-only_with-roots.gff
cat GRCh38_canonical_primary_VDJ-rDNA-chrY-discarded.gff CHM13_canonical_chrY-only.gff > hybrid.gff

### Run first pass of liftoff

In [None]:
%%bash
source ~/.bashrc
time liftoff \
  -p 40 -polish -exclude_partial -copies -sc 0.95 \
  -g Refseq/CHM13_canonical_rDNA-only_with-roots.gff -f Refseq/rRNA_types.txt -chroms Refseq/rRNA_chroms.csv \
  -dir /home/rhuang38/Ash1/run_liftoff/Refseq_1st_pass_intermediates \
  -o   /home/rhuang38/Ash1/run_liftoff/Refseq_1st_pass_intermediates/rRNA_only.gff \
  -u   /home/rhuang38/Ash1/run_liftoff/Refseq_1st_pass_intermediates/unmapped_rRNAs.txt \
  fasta/Ash1_canonical_VDJ-masked.fna fasta/CHM13_canonical.fna

bedtools maskfasta \
  -fi fasta/Ash1_canonical_VDJ-masked.fna \
  -bed ../run_liftoff/Refseq_1st_pass_intermediates/rRNA_only.gff_polished \
  -fo fasta/Ash1_canonical_VDJ-rDNA-masked.fna

### Run second pass of liftoff

In [None]:
%%bash
time liftoff \
  -p 40 -polish -exclude_partial -copies -sc 0.95 \
  -g Refseq/hybrid.gff -f Refseq/types.txt -chroms Refseq/canonical_chroms.txt \
  -dir /home/rhuang38/Ash1/run_liftoff/Refseq_2nd_pass_intermediates \
  -o /home/rhuang38/Ash1/run_liftoff/Refseq_2nd_pass_intermediates/second_pass.gff \
  -u /home/rhuang38/Ash1/run_liftoff/Refseq_2nd_pass_intermediates/unmapped_features.txt \
  fasta/Ash1_canonical_VDJ-rDNA-masked.fna fasta/hybrid.fna

time liftoff \
  -p 40 -polish -exclude_partial -copies -sc 0.95 \
  -g Refseq/hybrid.gff -f Refseq/types.txt -chroms Refseq/canonical_chroms.txt \
  -dir /home/rhuang38/Ash1/run_liftoff/Refseq_2nd_pass_prot-prior_intermediates \
  -o /home/rhuang38/Ash1/run_liftoff/Refseq_2nd_pass_prot-prior_intermediates/second_pass.gff \
  -u /home/rhuang38/Ash1/run_liftoff/Refseq_2nd_pass_prot-prior_intermediates/unmapped_features.txt \
  fasta/Ash1_canonical_VDJ-rDNA-masked.fna fasta/hybrid.fna -prot_prior

### Merge results from two passes

In [None]:
%%bash
gffread -F --keep-genes --sort-alpha rRNA_only.gff_polished > rRNA_only.sorted.gff
cat rRNA_only.sorted.gff second_pass.gff_polished > Ash1.gff
gffread -F --keep-genes --sort-by fasta/reflst.txt Ash1.gff > result/Ash1_refseq.gff