Skip to content

Conversation

@hayesall
Copy link
Member

@hayesall hayesall commented Nov 30, 2021

This modifies the ddi data on the starling page to pass the linting checks.

Here's a sample of the original facts:

Target("3-hydroxy-3-methylglutaryl-coenzyme_A_reductase","Pravastatin").
Target("Gamma-aminobutyric_acid_type_B_receptor_subunit_1","Baclofen").
Target("Gamma-aminobutyric_acid_type_B_receptor_subunit_2","Baclofen").
Target("Synaptic_vesicular_amine_transporter","Amphetamine").
Target("Sodium-dependent_dopamine_transporter","Amphetamine").
Target("Cocaine-_and_amphetamine-regulated_transcript_protein","Amphetamine").
Target("Trace_amine-associated_receptor_1","Amphetamine").

and here is what it looks like following the changes:

target(3_hydroxy_3_methylglutaryl_coenzyme_a_reductase,pravastatin).
target(gamma_aminobutyric_acid_type_b_receptor_subunit_1,baclofen).
target(gamma_aminobutyric_acid_type_b_receptor_subunit_2,baclofen).
target(synaptic_vesicular_amine_transporter,amphetamine).
target(sodium_dependent_dopamine_transporter,amphetamine).
target(cocaine__and_amphetamine_regulated_transcript_protein,amphetamine).
target(trace_amine_associated_receptor_1,amphetamine).

This was done by replacing -, ", and / in the drug names with underscores _ and converting everything to lowercase:

corrected = [name.replace("-", "_").replace('"', "").replace("/", "").lower() for name in names]

As a precaution, I recorded each "corrected" value to a dictionary, and would throw an error if the same key was mapped to two different values in the original set:

{
 "3_hydroxy_3_methylglutaryl_coenzyme_a_reductase": [
  "\"3-hydroxy-3-methylglutaryl-coenzyme_A_reductase\""
 ],
 "pravastatin": [
  "\"Pravastatin\""
 ],
 "gamma_aminobutyric_acid_type_b_receptor_subunit_1": [
  "\"Gamma-aminobutyric_acid_type_B_receptor_subunit_1\""
 ],
 "baclofen": [
  "\"Baclofen\""
 ]
}

I was worried there might be cases like: Warfarin/other and Warfarin_other that would get mapped into the same bucket, but this did not occur and structures should be equivalent up to renaming.


The code I used to do this is copied below, but it isn't interesting enough to commit to the repository:

Python script to clean DDI data to pass linter
from collections import defaultdict


def load_file(filename):
    with open(filename, "r") as fh:
        return fh.read().splitlines()

def split_into_parts(input_line):
    head, tail = input_line.split("(")
    first, _ = tail.split(")")
    names = first.split(",")

    correct_head = head.lower()
    corrected = [name.replace("-", "_").replace('"', "").replace("/", "").lower() for name in names]

    return names, correct_head, corrected

if __name__ == "__main__":

    mapping = defaultdict(set)
    output = []

    for line in load_file("drug_interactions/train/train_facts.txt"):

        values, correct_head, corrected = split_into_parts(line)

        # Assert that a "new" key doesn't map to two "old" keys.
        for old, new in zip(values, corrected):
            mapping[new].add(old)

            if len(mapping[b]) > 1:
                print("Encountered duplicate")
                print(mapping[b])
                exit(2)

        result = f"{correct_head}({','.join([a for a in corrected])})."
        output.append(result)

    with open("../ddi2/ddi2/train/train_facts.txt", "w") as fh:
        for line in output:
            fh.write(line + "\n")

@hayesall hayesall merged commit e6e41ec into main Nov 30, 2021
@hayesall hayesall deleted the ddi branch November 30, 2021 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants