# Accounting human miRNA overhangs

Input file `hsa.gff3` obtained from miRBase: 
<https://www.mirbase.org/ftp/CURRENT/genomes/hsa.gff3>

## Creating individual files 

First of all, different files for each feature - pre-miRs and mature miRNAs - and split the last one into 3 different files:
- `mirna_5p.gff3` for the 5p mature miRNAs. 
- `mirna_3p.gff3` for the 3p mature miRNAs.  
- `mirna_single.gff3` for the pre-miRs with a single mature miRNA.

In addition, names in column 9 will be formatted to contain only the name - no ID, no alias.

In [26]:
## Reading file
with open("hsa.gff3", "r") as f:
    records = [line.strip().split() for line in f.readlines() if line[0] != "#"]

## Formating last col to contain only the name
for rec in records:
    rec[-1] = rec[-1].split("=")[3].split(";")[0]

## Nesting lists to create files
pre_mirna = [rec for rec in records if rec[2] == "miRNA_primary_transcript"]
mirna_5p = [rec for rec in records if rec[2] == "miRNA" and rec[-1].split("-")[-1] == "5p"]
mirna_3p = [rec for rec in records if rec[2] == "miRNA" and rec[-1].split("-")[-1] == "3p"]
mirna_single = [rec for rec in records if rec[2] == "miRNA" and rec[-1].split("-")[-1] != "5p" and rec[-1].split("-")[-1] != "3p"]

## Removing transcript number from the name     
for record in pre_mirna:
    name = record[-1].split("-")
    record[-1] = "-".join(name[:3])

print(f" There are:\n- {len(pre_mirna)} miRNA primary transcripts")
print(f"- {len(mirna_5p)} mature miRNAs annotated as 5p")
print(f"- {len(mirna_3p)} mature miRNA annotated as 3p")
print(f"- {len(mirna_single)} mature miRNA annotated to be the only transcript of the pre-miRNA")



 There are:
- 1918 miRNA primary transcripts
- 992 mature miRNAs annotated as 5p
- 992 mature miRNA annotated as 3p
- 899 mature miRNA annotated to be the only transcript of the pre-miRNA


In the [Accounting human miRNA annotation file](https://github.com/zavolanlab/mirflowz/blob/mirna_accounting/explorations/accounting_mirna_annotations.ipynb) there are the same amount of mature miRNA records, 2883, but the distribution among `5p`, `3p` and `single` are different.
There are 27 more entries in both the `5p` and the `3p` therefore 54 less entries counted as single mature miRNAs.

This difference takes place due to some single-mature miRNA being annotated as `5p` or `3p`. 
If subtracting the extra sequences and adding them where they belong, the they sum up to the number of pre-mirs, 1918 

### Solution

Given this situation, the filtering must be done in another way. We know that if a pre-miR has two mature sequences, the features in the `gff3` file will be ordered as:

```
miRNA_primary_transcript_1
miRNA_1
miRNA_1
miRNA_primary_transcript_2
```

On the other hand, if the pre-miRNA has a single mature form, the `gff3` will look:

```
miRNA_primary_transcript_1
miRNA_1
miRNA_primary_trasncript_2
```
This configuration helps creating the files:

In [33]:
## Creating empty lists to store entries
pre_mirna = []
mirna_5p = []
mirna_3p = []
mirna_single = []

## Setting values for previous and next entries
prev_rec = None
pos_rec = None

## For each record except the last two...
for idx in range(len(records) - 2):

    ## ..if first entry..
    if prev_rec == None:
        prev_rec = records[idx][2]
        pos_rec = records[idx + 2][2]
        pre_mirna.append(records[idx])

    ## ..otherwise..
    else:
        if records[idx][2] == "miRNA_primary_transcript":
            pre_mirna.append(records[idx])
        else:
            if prev_rec == pos_rec:
                mirna_single.append(records[idx])
            else:
                if records[idx][-1].split("-")[-1] == "5p":
                    mirna_5p.append(records[idx])
                else:
                    mirna_3p.append(records[idx])
        prev_rec = records[idx][2]
        pos_rec = records[idx + 2][2]

## Reading last two entries
if records[-2][2] == "miRNA_primary_transcript":
    pre_mirna.append(records[-2])
    mirna_single.append(records[-1])
else:
    if records[-2][-1].split("-")[-1] == "5p":
        mirna_5p.append(records[-2])
        mirna_3p.append(records[-1])
    else:
        mirna_3p.append(records[-2])
        mirna_5p.append(records[-1])


print(f" There are:\n- {len(pre_mirna)} miRNA primary transcripts")
print(f"- {len(mirna_5p)} mature miRNAs annotated as 5p")
print(f"- {len(mirna_3p)} mature miRNA annotated as 3p")
print(f"- {len(mirna_single)} mature miRNA annotated to be the only transcript of the pre-miRNA")

 There are:
- 1918 miRNA primary transcripts
- 965 mature miRNAs annotated as 5p
- 965 mature miRNA annotated as 3p
- 953 mature miRNA annotated to be the only transcript of the pre-miRNA


Now that we got the correct classification counts, let's count the nucleotides' overhang per sequence in the form of dictionaries.  

In [36]:
import pandas as pd

#### REFORMATTING ENTRIES TO EASE THE COUNT ####

end3_clean = []
end5_clean = []
single_clean = []
pre_clean = []

for entry in mirna_3p:
    name = entry[-1]
    name = name.split("-")[:3]
    name = "-".join(name)
    if entry[6] == '-':
        end3_clean.append([name.lower(), entry[3], entry[0], entry[6]])
    else:
        end3_clean.append([name.lower(), entry[4], entry[0], entry[6]])

for entry in mirna_5p:
    name = entry[-1]
    name = name.split("-")[:3]
    name = "-".join(name)
    if entry[6] == '-':
        end5_clean.append([name.lower(), entry[4], entry[0], entry[6]])
    else:
        end5_clean.append([name.lower(), entry[3], entry[0], entry[6]])

for entry in mirna_single:
    name = entry[-1]
    name = name.split("-")[:3]
    name = "-".join(name)
    single_clean.append([name.lower(), entry[3], entry[4], entry[0], entry[6]])

for entry in pre_mirna:
    name = entry[-1]
    name = name.split("-")[:3]
    name = "-".join(name)
    pre_clean.append([name, entry[3], entry[4], entry[0], entry[6]])


#### MAIN PART: CREATING TABLES ####

count_5 = {}
count_3 = {}
total_5 = 0
total_3 = 0


for mir in pre_clean:
    for miR3 in end3_clean:
        if mir[0] == miR3[0] and mir[1] <= miR3[1] <= mir[2] and mir[3] == miR3[2]:
            total_3 += 1
            ntl = str(int(mir[2]) - int(miR3[1]))
            if int(ntl) > 20:
            	ntl = "20+"
            if ntl in count_3.keys():
                count_3[ntl] += 1
            else:
                count_3[ntl] = 1
            idx = end3_clean.index(miR3)
            del end3_clean[idx]
            break

    for miR5 in end5_clean:
        if mir[0] == miR5[0] and mir[1] <= miR5[1] <= mir[2] and mir[3] == miR5[2]:
            total_5 += 1
            ntl = str(int(miR5[1]) - int(mir[1]))
            if int(ntl) > 20:
            	ntl = "20+"
            if ntl in count_5.keys():
                count_5[ntl] += 1
            else:
                count_5[ntl] = 1
            idx = end5_clean.index(miR5)
            del end5_clean[idx]
            break
    
    for miR in single_clean:
        if mir[0] == miR[0] and mir[1] <= miR[1] and mir[2] >= miR[2] and mir[3] == miR[3]:
            total_5 += 1
            total_3 += 1
            ntl_3 = str(int(mir[2]) - int(miR[2]))
            ntl_5 = str(int(miR[1]) - int(mir[1]))
            if int(ntl_3) > 20:
            	ntl_3 = "20+"
            if int(ntl_5) > 20:
            	ntl_5 = "20+"
            if ntl_5 in count_5.keys():
                count_5[ntl_5] += 1
            else:
                count_5[ntl_5] = 1
            if ntl_3 in count_3.keys():
                count_3[ntl_3] += 1
            else:
                count_3[ntl_3] = 1
            idx = single_clean.index(miR)
            del single_clean[idx]
            break


#### WRITTING DATARFAME TO MANIPULATE ON R ####

final3 = dict(sorted(count_3.items()))
final5 = dict(sorted(count_5.items()))

df3 = pd.DataFrame(final3, index=[0])
df3.to_csv("count_3.csv", index = False, sep = "\t")

df5 = pd.DataFrame(final5, index=[0])
df5.to_csv("count_5.csv", index = False, sep = "\t")

## Results

What follows is the script used in R to create the tables:

```R
library(tidyr)
library(dplyr)

## Creating table for the 3p overhangs
count_3 <- read.csv("count_3.csv", header = F, sep = "\t")
count_3 <- as.data.frame(t(count_3))
row.names(count_3) <- NULL
colnames(count_3) <- c("Overhang", "Count")

count_3 <- count_3 %>% 
                mutate("Fraction" = round(as.numeric(Count) /1918, 4)) %>%
                arrange(as.numeric(Overhang))
count_3$Overhang <- as.numeric(count_3$Overhang)
cum3 <- cumsum(count_3$Fraction)
count_3 <- count_3 %>%
  mutate("Comulative Proportion" = cum3)


## Creating table for the 5p overhangs
count_5 <- read.csv("count_5.csv", header = F, sep = "\t")
count_5 <- as.data.frame(t(count_5))
row.names(count_5) <- NULL
colnames(count_5) <- c("Overhang", "Count")

count_5 <- count_5 %>% 
                mutate("Fraction" = round(as.numeric(Count) /1918, 4)) %>%
                arrange(as.numeric(Overhang))
count_5$Overhang <- as.numeric(count_5$Overhang)
cum5 <- cumsum(count_5$Fraction)
count_5 <- count_5 %>%
  mutate("Comulative Proportion" = cum5)


# Merging tables

both <- inner_join(count_5, count_3,by = "Overhang", suffix = c("_5p", "_3p")) %>%
        arrange(as.numeric(Overhang))
  
both$Overhang[22] <- "20+"



write.table(both, "overhang.txt", quote = F, sep = "\t", row.names = F)
```

The final table looks like:

|Overhang (nts) |Count_5p | Proportion_5p| Cumulative_5p|Count_3p | Proportion_3p| Cumulative_3p| 
|:--------------|:--------|-------------:|-------------:|:--------|-------------:|-------------:| 
|0              |83       |        0.0433|        0.0433|139      |        0.0725|        0.0725| 
|1              |19       |        0.0099|        0.0532|28       |        0.0146|        0.0871| 
|2              |27       |        0.0141|        0.0673|37       |        0.0193|        0.1064| 
|3              |25       |        0.0130|        0.0803|34       |        0.0177|        0.1241| 
|4              |33       |        0.0172|        0.0975|37       |        0.0193|        0.1434| 
|5              |102      |        0.0532|        0.1507|24       |        0.0125|        0.1559| 
|6              |21       |        0.0109|        0.1616|32       |        0.0167|        0.1726| 
|7              |31       |        0.0162|        0.1778|35       |        0.0182|        0.1908| 
|8              |36       |        0.0188|        0.1966|39       |        0.0203|        0.2111| 
|9              |95       |        0.0495|        0.2461|63       |        0.0328|        0.2439| 
|10             |111      |        0.0579|        0.3040|133      |        0.0693|        0.3132| 
|11             |26       |        0.0136|        0.3176|51       |        0.0266|        0.3398| 
|12             |40       |        0.0209|        0.3385|53       |        0.0276|        0.3674| 
|13             |35       |        0.0182|        0.3567|33       |        0.0172|        0.3846| 
|14             |49       |        0.0255|        0.3822|33       |        0.0172|        0.4018| 
|15             |88       |        0.0459|        0.4281|63       |        0.0328|        0.4346| 
|16             |14       |        0.0073|        0.4354|7        |        0.0036|        0.4382| 
|17             |8        |        0.0042|        0.4396|17       |        0.0089|        0.4471| 
|18             |7        |        0.0036|        0.4432|10       |        0.0052|        0.4523| 
|19             |12       |        0.0063|        0.4495|14       |        0.0073|        0.4596| 
|20             |17       |        0.0089|        0.4584|15       |        0.0078|        0.4674| 
|20+            |1036     |        0.5401|        0.9985|1021     |        0.5323|        0.9997|