# Accounting human miRNA annotation file

Input file `hsa.gff3` obtained from miRBAse: https://www.mirbase.org/ftp/CURRENT/genomes/hsa.gff3

## Accounting for all records in file

First we are checking that the file only consists of comments, pre-miR and mature miRNA records:

In [3]:
comments = 0
premirs = []
mature = []
other = []
with open("hsa.gff3") as _file:
    for record in _file:
        if record.startswith("#"):
            comments += 1
            continue
        record = record.split()
        if record[2] == "miRNA_primary_transcript":
            premirs.append(record[8])
        elif record[2] == "miRNA":
            mature.append(record[8])
        else:
            other.append(record)

with open("hsa.gff3") as _file:
    n_lines = len(_file.readlines())

print(f"{n_lines=}\n")            
print(f"{comments=}")
print(f"{len(premirs)=}")
print(f"{len(mature)=}")
print(f"{len(other)=}")
print(
    "\nTotal number of lines in file equals to number "
    "of comments, pre-miRNA and mature miRNA records: "
    f"{n_lines == comments + len(premirs) + len(mature)}"
)

n_lines=4814

comments=13
len(premirs)=1918
len(mature)=2883
len(other)=0

Total number of lines in file equals to number of comments, pre-miRNA and mature miRNA records: True


## How many mature miRNAs are associated with the pre-miRNA transcripts?

So no surprises here. Now let's get the total number of and IDs of all mature miRNAs for a given pre-miR. We store this in a dictionary `premirs` with the attribute field of pre-miRs as keys and the number of associated mature miRNAs as values.

In [4]:
last_premir = None
premirs = {}
matures = []
with open("hsa.gff3") as _file:
    for record in _file:
        if record.startswith("#"):
            continue
        record = record.split()
        if record[2] == "miRNA_primary_transcript":
            # conclude processing of last pre-miR record
            if last_premir is not None:
                premirs[last_premir] = len(matures)
                matures = []
            last_premir = record[8]
        else:
            matures.append(record[8])
# account for very last record
premirs[last_premir] = len(matures)

Now let's check if all pre-miRs and mature miRs are accounted for:

In [5]:
print(f"Number of pre-miR records: {len(premirs)}")
print(f"Number of mature miRNA records: {sum(matures for matures in premirs.values())}")

Number of pre-miR records: 1918
Number of mature miRNA records: 2883


Good. Now let's check how many pre-miRs have how many mature miRNAs associated with them:

In [6]:
from collections import defaultdict

matures_per_premir = defaultdict(lambda: 0)
for premir, matures in premirs.items():
    matures_per_premir[matures] += 1 

print(f"Out of the {sum(matures_per_premir.values())} pre-miRNA transcripts,")
for n_mature, count in matures_per_premir.items():
    print(f"- {count} have {n_mature} mature miRNAs")
n_matures = sum({n_mature: n_mature * count for n_mature, count in matures_per_premir.items()}.values())
print(f"associated with them, accounting for a total of {n_matures} mature miRNAs.")

Out of the 1918 pre-miRNA transcripts,
- 965 have 2 mature miRNAs
- 953 have 1 mature miRNAs
associated with them, accounting for a total of 2883 mature miRNAs.


## Conclusion

We have established that `hsa.gff3` contains 4814 lines:
- 13 comment lines
- 1918 pre-miR transcript records
- 2883 mature miRNA records

Out of the 1918 pre-miRNA records,
- 953 have 1 mature miRNA
- 965 have 2 mature miRNAs

associated with them, accounting for all 2883 mature miRNAs.

This means that that **all of the 1918 pre-miRNA transcript have exactly one or two mature miRNAs associated with them**!