# Exploring PMN file formats
Specs listed [here](http://bioinformatics.ai.sri.com/ptools/flatfile-format.html#reactions.dat), playing around specifically with files from `aracyc/17.0/data/`

In [1]:
import pandas as pd
import numpy as np
import json
from os import listdir
from os.path import join

In [2]:
pathbase = '../data/straying_off_topic_data/pmn_data/pmn/plantcyc/aracyc/17.0/data/'

My current thought is that, if the fields are consistent across the entire document (they may not be, there have been no promises), those fields can become column names, and each `//` `//` separated section can be a row. Will also eventually have to account for comments, but going to intentionally use a file without comments for now just to see if what I think is possible works.

Update: They are definitely not consistent across the entire document (e.g. fields are just missing entirely if they are empty in a given entry). Additionally, in some files, the full list of fields is specified in the opening comment, but in some it is not. This is irritating af beacuse it means that I'll have to manually supply the possible field names from the website, but whatever. Also, everything has comments, so I'm just going to have to deal with it now.

My actual question is, how on earth do they build the website from this stuff????

Also -- some files seem to have both a `.col` and a `.dat` version, do they actually contain the same information? Going to read in both for this play file and see.

#### Reading in .dat file

Ah yes, some files do have the requisite ` - `.... and there are also `-`'s in the field names. *cries internally*

The good news is that the attr-value separating hyphens are surrounded by spaces -- so I will look for the first occurence of ` - ` in order to separate fields from values.

Fields can occur multiple times with different values... great.

The docs say that for long strings the field entries can be on multiple lines... I hope to god that means thay wrap depending on the size of the screen, and not that they're actually split wiht newlines. I'm going to assume the former until proven guilty, because idk what kind of masochist would do the latter.

In [3]:
play_file = f'{pathbase}/genes.dat'
play_file_fields = [
    'UNIQUE-ID',
    'TYPES',
    'COMMON-NAME',
    'CENTISOME-POSITION',
    'CITATIONS',
    'COMMENT',
    'COMPONENT-OF',
    'COMPONENTS',
    'DBLINKS',
    'IN-PARALOGOUS-GENE-GROUP',
    'INTERRUPTED?',
    'LAST-UPDATE',
    'LEFT-END-POSITION',
    'PRODUCT',
    'PRODUCT-STRING',
    'RIGHT-END-POSITION',
    'SYNONYMS',
    'TRANSCRIPTION-DIRECTION'
]

In [29]:
def read_dat(path, fields):
    """
    Given a file path and a list of possible field names, create a df from a PMN .dat file.
    
    Edge cases accounted for:
     - Comments
     - Fieldnames being used multiple times with different values in the same entry
     - Hyphens not preceeded or followed by spaces being present in fieldnames or values
     - Empty lines
     - // being present in URLs (separates only on // followed by a newline)
    
    parameters:
        path, str: path to file to read
        fields, list of str: list of possible field names in file to read
        
    returns: 
        df, pandas df: dataframe with field names as columns, with NA's in cells where the field was missing
            for that row
    """
    # Read in file
    with open(path, encoding='windows-1250') as myf:
        lines = myf.read()
        
    # Separate file on // first
    lines = lines.split('//\n')
    
    # Make a list of dicts with each dict being an entry
    df_dicts = []
    for entry in lines:
        # Split on newlines to get individual entries
        attr_vals = entry.split('\n')
        # Drop any empty lines
        attr_vals = [av for av in attr_vals if av]
        # Drop any lines beginning with #
        attr_vals = [av for av in attr_vals if av[0][0]!='#']
        # Split on ' - ' to get the field name and the field value in a nested list
        attr_vals = [av.split(' - ') for av in attr_vals]
        
        entry_dict = {}
        for colname in fields:
            # Get all attr-val pairs with the fieldname
            field_attr_vals = [av for av in attr_vals if av[0]==colname]
            # If there's more than one entry for this fieldname, make the vals into a tuple
            if len(field_attr_vals) > 1:
                val = tuple([fav[1] for fav in field_attr_vals])
            # If there aren't any entries for this fieldname, add an NA
            elif len(field_attr_vals) == 0:
                val = np.nan
            # If there's only one value, add that as a single val
            else: val = field_attr_vals[0][1]
            # Add entry to dict
            entry_dict[colname] = val
            
        df_dicts.append(entry_dict)

    # Make the df
    df = pd.DataFrame(df_dicts)
    
    return df

In [5]:
play_df = read_dat(play_file, play_file_fields)
play_df.head()

Unnamed: 0,UNIQUE-ID,TYPES,COMMON-NAME,CENTISOME-POSITION,CITATIONS,COMMENT,COMPONENT-OF,COMPONENTS,DBLINKS,IN-PARALOGOUS-GENE-GROUP,INTERRUPTED?,LAST-UPDATE,LEFT-END-POSITION,PRODUCT,PRODUCT-STRING,RIGHT-END-POSITION,SYNONYMS,TRANSCRIPTION-DIRECTION
0,AT3G25960,Unclassified-Genes,AT3G25960,40.474438,,,CHROMOSOME-3-4,,"(TAIR ""AT3G25960"" NIL |pzhang| 3739906193 NIL ...",,,,9499676.0,AT3G25960-MONOMER,,9501169.0,,
1,AT4G28130,Unclassified-Genes,AT4G28130,75.17636,,,CHROMOSOME-4-7,,"(TAIR ""AT4G28130"" NIL |pzhang| 3739906195 NIL ...",,,,13971558.0,AT4G28130-MONOMER,,13974329.0,DGK6,
2,AT5G10650,ORFs,AT5G10650,,,,,,,,,,,"(MONOMERQT-7998, MONOMERQT-7997, AT5G10650-MON...",,,,
3,AT1G25055,ORFs,AT1G25055,,,,,,,,,,,AT1G25055-MONOMER,,,,
4,AT3G28480,Unclassified-Genes,AT3G28480,45.49173,,,CHROMOSOME-3-5,,"(TAIR ""AT3G28480"" NIL |pzhang| 3739906193 NIL ...",,,,10677270.0,AT3G28480-MONOMER,,10679625.0,,


#### Read in the corresponding .col file

There's a bigass comment section at the top of this too, ffs

In [6]:
play_col_path = f'{pathbase}/genes.col'
play_col = pd.read_csv(play_col_path, sep='\t', comment='#')
play_col.head()

Unnamed: 0,UNIQUE-ID,NAME,PRODUCT-NAME,SWISS-PROT-ID,REPLICON,START-BASE,END-BASE,SYNONYMS,SYNONYMS.1,SYNONYMS.2,SYNONYMS.3,GENE-CLASS,GENE-CLASS.1,GENE-CLASS.2,GENE-CLASS.3
0,AT3G25960,AT3G25960,pyruvate kinase,Q9LU95,CHROMOSOME-3,9499676.0,9501169.0,,,,,UNCLASSIFIED,,,
1,AT4G28130,AT4G28130,diacylglycerol kinase,,CHROMOSOME-4,13971558.0,13974329.0,DGK6,,,,UNCLASSIFIED,,,
2,AT5G10650,AT5G10650,AT5G10650,,,,,,,,,,,,
3,AT1G25055,AT1G25055,AT1G25055,,,,,,,,,,,,
4,AT3G28480,AT3G28480,"oxidoreductase, acting on paired donors, with ...",Q9LSI6,CHROMOSOME-3,10677270.0,10679625.0,,,,,UNCLASSIFIED,,,


In [7]:
# Check if the unique-name cols have the same vals
dat_genes = set(play_df['UNIQUE-ID'])
col_genes = set(play_col['UNIQUE-ID'])
dat_genes == col_genes

False

In [8]:
overlap = dat_genes.intersection(col_genes)
print(len(dat_genes), len(col_genes), len(overlap))

6346 6345 6345


It looks like these files encode totally different information about more or less the same genes... There is one gene missing from the col file that's present in the dat file. WHY ARE THERE TWO DIFFERENT FILES FOR THIS IM SCREAMING

## Deciding what data can be useful
Want to choose file types and categories that I'll actually utilize in my relation extraction pipeline

In [9]:
!ls ../data/straying_off_topic_data/pmn_data/pmn/plantcyc/aracyc/17.0/data/

biopax-level2.owl		genes.dat		 pubs.dat
biopax-level3.owl		metabolic-reactions.xml  reaction-links.dat
classes.dat			overview.graph		 reactions.dat
compound-links.dat		pathway-links.dat	 regulation.dat
compounds.dat			pathways.col		 regulons.dat
dnabindsites.dat		pathways.dat		 rnas.dat
enzrxns.dat			promoters.dat		 species.dat
enzymes.col			protcplxs.col		 terminators.dat
gene_association.aracyc		protein-features.dat	 transporters.col
gene_association.aracyc-errors	protein-links.dat	 transunits.dat
gene-links.dat			proteins.dat
genes.col			protligandcplxes.dat


In [10]:
!ls ../data/straying_off_topic_data/pmn_data/pmn/plantcyc/aracyc/17.0/input/

Araport11_genes.201606.pep.repr.fasta.pmn.e2p2v3.orxn.pf
at_chloroplast.pf
at_chrom1.pf
at_chrom2.pf
at_chrom3.pf
at_chrom4.pf
at_chrom5.pf
at_genetic_loci.pf
at_mito.pf
genetic-elements.dat
organism.dat
organism.dat~
organism-init.dat
sample-genetic-elements.dat


In [11]:
aracyc = pd.read_csv('../data/straying_off_topic_data/pmn_data/pmn/plantcyc/aracyc/17.0/data/gene_association.aracyc', sep='\t', header=None)
aracyc.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,AraCyc,AT3G09260-MONOMER,AT3G09260,,GO:0010168,PMID:12581307,IDA,,C,scopolin beta-glucosidase,AT3G09260|PYK10|PSR3.1|LEB|BGLU23|AT3G09260,protein,taxon:,20130130,AraCyc
1,AraCyc,AT3G09260-MONOMER,AT3G09260,,GO:0005783,PMID:22923678,IDA,,C,scopolin beta-glucosidase,AT3G09260|PYK10|PSR3.1|LEB|BGLU23|AT3G09260,protein,taxon:,20130130,AraCyc
2,AraCyc,AT5G48950-MONOMER,AT5G48950,,GO:0005777,PMID:22372525,IDA,,C,"1,4-dihydroxy-2-naphthoyl-CoA thioesterase 2",DHNAT2|AT5G48950,protein,taxon:,20120420,AraCyc
3,AraCyc,AT2G40690-MONOMER,AT2G40690,,GO:0031969,PMID:20061580,IDA,,C,glycerol 3-phosphate dehydrogenase,SFD1|GLY1|AT2G40690,protein,taxon:,20120402,AraCyc
4,AraCyc,AT2G30140-MONOMER,AT2G30140,,GO:0005737,PMID:22404750,IDA,,C,UDP-glycosyltransferase 87A2,UGT87A2|AT2G30140,protein,taxon:,20120426,AraCyc


Thank you for having no information whatsoever about wtf the contents of this file are............

To save myself a little heartache in the future, I think I'm going to make a file that contains the field names for the various types of files listed on the specs page, it'll be a json where the key is the file name and the values are lists of the field names. This will also help me see what kind of data each one has & whether or not I want to use it.

In [12]:
field_json = {
    'classes.dat':['UNIQUE-ID','TYPES','COMMENT','COMMON-NAME','SYNONYMS'],
    'bindrxns.dat':['UNIQUE-ID','TYPES','ACTIVATORS','INHIBITORS','OFFICIAL-EC','REACTANTS'],
    'compounds.dat':['UNIQUE-ID','TYPES','COMMON-NAME','ABBREV-NAME','ATOM-CHARGES','CHEMICAL-FORMULA',
                    'CITATIONS','COFACTORS-OF','COFACTORS-OR-PROSTHETIC-GROUPS-OF','COMMENT','COMPONENT-OF',
                    'CREDITS','DATA-SOURCE','DBLINKS','INCHI','MOLECULAR-WEIGHT','N+1-NAME','N-1-NAME','N-NAME',
                    'PKA1','PKA2','PKA3','PROSTHETIC-GROUPS-OF','REGULATES','SMILES','SUPERATOMS','SYNONYMS',
                    'SYSTEMATIC-NAME'],
    'dnabindsites.dat':['UNIQUE-ID','TYPES','ABS-CENTER-POS','CITATIONS','COMMENT','COMPONENT-OF','DBLINKS',
                        'INVOLVED-IN-REGULATION','REGULATED-PROMOTER','RELATIVE-CENTER-DISTANCE','SYNONYMS',
                        'TYPE-OF-EVIDENCE'],
    'enzrxns.dat':['UNIQUE-ID','TYPES','COMMON-NAME','ALTERNATIVE-COFACTORS','ALTERNATIVE-SUBSTRATES','CITATIONS',
                    'COFACTOR-BINDING-COMMENT','COFACTORS','COFACTORS-OR-PROSTHETIC-GROUPS','COMMENT','ENZYME','KM',
                    'PH-OPT','PROSTHETIC-GROUPS','REACTION','REACTION-DIRECTION','REGULATED-BY',
                    'REQUIRED-PROTEIN-COMPLEX','SYNONYMS','TEMPERATURE-OPT'],
    'genes.dat':['UNIQUE-ID','TYPES','COMMON-NAME','CENTISOME-POSITION','CITATIONS','COMMENT','COMPONENT-OF',
                'COMPONENTS','DBLINKS','IN-PARALOGOUS-GENE-GROUP','INTERRUPTED?','LAST-UPDATE','LEFT-END-POSITION',
                'PRODUCT','PRODUCT-STRING','RIGHT-END-POSITION','SYNONYMS','TRANSCRIPTION-DIRECTION'],
    'pathways.dat':['UNIQUE-ID','TYPES','COMMON-NAME','CITATIONS','CLASS-INSTANCE-LINKS','COMMENT','CREDITS',
                    'DBLINKS','ENZYME-USE','HYPOTHETICAL-REACTIONS','IN-PATHWAY','NET-REACTION-EQUATION',
                    'PATHWAY-INTERACTIONS','PATHWAY-LINKS','POLYMERIZATION-LINKS','PREDECESSORS','PRIMARIES',
                    'REACTION-LAYOUT','REACTION-LIST','SPECIES','SUB-PATHWAYS','SUPER-PATHWAYS','SYNONYMS'],
    'promoters.dat':['UNIQUE-ID','TYPES','COMMON-NAME','ABSOLUTE-PLUS-1-POS','BINDS-SIGMA-FACTOR','CITATIONS',
                    'COMMENT','COMPONENT-OF','DBLINKS','PROMOTER-EVIDENCE','REGULATED-BY','SYNONYMS'],
    'protein-features.dat':['UNIQUE-ID','TYPES','COMMON-NAME','COMMENT','ATTACHED-GROUP','CITATIONS','FEATURE-OF',
                            'HOMOLOGY-MOTIF','LEFT-END-POSITION','POSSIBLE-FEATURE-STATES','RESIDUE-NUMBER',
                            'RESIDUE-TYPE','RIGHT-END-POSITION'],
    'proteins.dat':['UNIQUE-ID','TYPES','COMMON-NAME','CATALYZES','CHEMICAL-FORMULA','CITATIONS','COMMENT',
                    'COMPONENT-OF','CREDITS','DBLINKS','DNA-FOOTPRINT-SIZE','FEATURES','GENE','GO-TERMS',
                    'ISOZYME-SEQUENCE-SIMILARITY','LOCATIONS','MODIFIED-FORM','MOLECULAR-WEIGHT-EXP',
                    'MOLECULAR-WEIGHT-KD','MOLECULAR-WEIGHT-SEQ','PI','REGULATES','SPECIES','SPLICE-FORM-INTRONS',
                    'SYMMETRY','SYNONYMS','UNMODIFIED-FORM'],
    'protligandcplxs.dat':['UNIQUE-ID','TYPES','CATALYZES','COMMON-NAME','COMMENT','COMPONENTS','DBLINKS',
                           'DNA-FOOTPRINT-SIZE','MOLECULAR-WEIGHT-KD','MOLECULAR-WEIGHT-SEQ','REGULATES','SYMMETRY',
                           'SYNONYMS'],
    'pubs.dat':['UNIQUE-ID','TYPES','ABSTRACT','AUTHORS','COMMENT','MEDLINE-UID','PUBMED-ID','REFERENT-FRAME',
                'SOURCE','TITLE','YEAR'],
    'reactions.dat':['UNIQUE-ID','TYPES','COMMON-NAME','CITATIONS','COMMENT','DELTAG0','EC-NUMBER',
                     'ENZYMATIC-REACTION','IN-PATHWAY','LEFT','OFFICIAL-EC?','ORPHAN?','RIGHT','SIGNAL','SPECIES',
                     'SPONTANEOUS?','SYNONYMS'],
    'regulation.dat':['UNIQUE-ID','TYPES','COMMON-NAME','ASSOCIATED-BINDING-SITE','COMMENT','MECHANISM','MODE',
                      'PHYSIOLOGICALLY-RELEVANT?','REGULATED-BY','REGULATED-ENTITY','REGULATOR','SYNONYMS'],
    'regulons.dat':['UNIQUE-ID','TYPES','COMMON-NAME','ACTIVATORS-ALLOSTERIC-OF','ACTIVATORS-NONALLOSTERIC-OF',
                    'ACTIVATORS-UNKMECH-OF','APPEARS-IN-BINDING-REACTIONS','AROMATIC-RINGS','ATOM-CHARGES',
                    'ATOM-CHIRALITY','CATALYZES','CHARGE','CHEMICAL-FORMULA','CITATIONS','COFACTORS-OF',
                    'COFACTORS-OR-PROSTHETIC-GROUPS-OF','COMMENT','COMPONENT-COEFFICIENTS','COMPONENT-OF','COMPONENTS',
                    'CREDITS','DATA-SOURCE','DBLINKS','DNA-FOOTPRINT-SIZE','FEATURES','FUNCTIONAL-ASSIGNMENT-COMMENT',
                    'FUNCTIONAL-ASSIGNMENT-STATUS','GENE','GO-TERMS','INHIBITORS-ALLOSTERIC-OF',
                    'INHIBITORS-COMPETITIVE-OF','INHIBITORS-IRREVERSIBLE-OF','INHIBITORS-NONCOMPETITIVE-OF',
                    'INHIBITORS-OTHER-OF','INHIBITORS-UNCOMPETITIVE-OF','INHIBITORS-UNKMECH-OF',
                    'INSTANCE-NAME-TEMPLATE','ISOZYME-SEQUENCE-SIMILARITY','LOCATIONS','MODIFIED-FORM',
                    'MOLECULAR-WEIGHT','MOLECULAR-WEIGHT-EXP','MOLECULAR-WEIGHT-KD','MOLECULAR-WEIGHT-SEQ','N+1-NAME',
                    'N-1-NAME','N-NAME','NEIDHARDT-SPOT-NUMBER','PI','PROSTHETIC-GROUPS-OF','REGULATED-BY','REGULATES',
                    'SPECIES','SPLICE-FORM-INTRONS','STRUCTURE-BONDS','SYMMETRY','SYNONYMS','UNMODIFIED-FORM'],
    'terminators.dat':['UNIQUE-ID','TYPES','COMMON-NAME','APPEARS-IN-BINDING-REACTIONS','CITATIONS','COMMENT',
                       'COMPONENT-OF','COMPONENTS','CREDITS','DATA-SOURCE','DBLINKS','INSTANCE-NAME-TEMPLATE',
                       'LEFT-END-POSITION','RIGHT-END-POSITION','SYNONYMS'],
    'transunits.dat':['UNIQUE-ID','TYPES','COMMON-NAME','APPEARS-IN-BINDING-REACTIONS','CITATIONS','COMMENT',
                      'COMPONENT-OF','COMPONENTS','CREDITS','DATA-SOURCE','DBLINKS','EXTENT-UNKNOWN?',
                      'INSTANCE-NAME-TEMPLATE','LEFT-END-POSITION','REGULATED-BY','RIGHT-END-POSITION','SYNONYMS']
}

In [13]:
with open('../models/benchmarks/pmn_relation_matching/pmn_dat_fields.json', 'w') as myf:
    json.dump(field_json, myf)

Please, thank me for my service.

Questions that arose during making this:
1. Is UNIQUE-ID always a gene name?

Ok well that was basically the one question I guess

Let's just read 'em in lads

In [30]:
data_files = []
for f in listdir(pathbase):
    if f in field_json.keys():
        print(f'\nReading in file {f}')
        try:
            df = read_dat(join(pathbase, f), field_json[f])
            data_files.append(df)
            print('Success!')
        except UnicodeDecodeError as e:
            print(f'Cannot read in {f}, error is:\n{e}')


Reading in file pubs.dat
Success!

Reading in file dnabindsites.dat
Success!

Reading in file pathways.dat
Success!

Reading in file classes.dat
Success!

Reading in file protein-features.dat
Success!

Reading in file regulation.dat
Success!

Reading in file compounds.dat
Success!

Reading in file reactions.dat
Success!

Reading in file terminators.dat
Success!

Reading in file genes.dat
Success!

Reading in file promoters.dat
Success!

Reading in file enzrxns.dat
Success!

Reading in file transunits.dat
Success!

Reading in file proteins.dat
Success!

Reading in file regulons.dat
Success!


Let's see if some of these files are encoded with a different character set, or if something is just broken:

In [18]:
!pip install charset-normalizer

Defaulting to user installation because normal site-packages is not writeable
Collecting charset-normalizer
  Using cached charset_normalizer-2.0.12-py3-none-any.whl (39 kB)
Installing collected packages: charset-normalizer
Successfully installed charset-normalizer-2.0.12


In [19]:
from charset_normalizer import detect
import urllib.request

In [25]:
file_encodings = {}
for f in listdir(pathbase):
    if f in field_json.keys():
        with open(join(pathbase, f), 'rb') as myf:
            rawdata = myf.read()
        file_encodings[f] = detect(rawdata)

In [26]:
file_encodings

{'pubs.dat': {'encoding': 'windows-1250',
  'language': 'English',
  'confidence': 1.0},
 'dnabindsites.dat': {'encoding': 'ascii',
  'language': 'English',
  'confidence': 1.0},
 'pathways.dat': {'encoding': 'windows-1250',
  'language': 'English',
  'confidence': 0.9482},
 'classes.dat': {'encoding': 'windows-1250',
  'language': 'English',
  'confidence': 1.0},
 'protein-features.dat': {'encoding': 'ascii',
  'language': 'English',
  'confidence': 1.0},
 'regulation.dat': {'encoding': 'ascii',
  'language': 'English',
  'confidence': 1.0},
 'compounds.dat': {'encoding': 'windows-1250',
  'language': 'English',
  'confidence': 0.9386},
 'reactions.dat': {'encoding': 'windows-1250',
  'language': 'Indonesian',
  'confidence': 0.9104},
 'terminators.dat': {'encoding': 'ascii',
  'language': 'English',
  'confidence': 1.0},
 'genes.dat': {'encoding': 'ascii',
  'language': 'English',
  'confidence': 0.9806},
 'promoters.dat': {'encoding': 'ascii',
  'language': 'English',
  'confidence'

I tried reading every file in with both `ascii` and `windows-1250` to see if Galen was right about them just not tripping errors for utf-8 on some of them even though they're technically encoded differently, and `windows-1250` works!