# Extract RXCUI relationships from the `RXNREL.RRF` file

2019-06-03

The RxNorm semantic network is contained in the `RXNREL.RRF` file.
We will extract the RXCUI relationships so that we can traverse the network in order to determine the active ingredients of each drug.

In [1]:
import pandas as pd

## Read `RXNREL.RRF`

In [2]:
rawd = (pd
    .read_csv(
        "../../data/rxnorm/RXNREL.RRF", sep="|", names=[
            "rxcui1", "rxaui1", "stype1", "rel",
            "rxcui2", "rxaui2", "stype2", "rela",
            "rui", "srui", "sab", "sl", "dir",
            "rg", "suppress", "cvf",
            "temp"
        ]
    )
    .dropna(how="all", axis=1)
)

In [3]:
rawd.shape

(6389185, 12)

In [4]:
rawd.head()

Unnamed: 0,rxcui1,rxaui1,stype1,rel,rxcui2,rxaui2,stype2,rela,rui,sab,dir,cvf
0,,5.0,AUI,SY,,6.0,AUI,permuted_term_of,140022271.0,MSH,,
1,,5.0,SDUI,SIB,,104746.0,SDUI,,140773282.0,MSH,,
2,,5.0,SDUI,RN,,609702.0,SDUI,mapped_to,139927355.0,MSH,1.0,
3,,5.0,AUI,SY,,2666961.0,AUI,sort_version_of,140943001.0,MSH,,
4,,5.0,AUI,SY,,2681015.0,AUI,entry_version_of,140018071.0,MSH,,


### Empty cells

In [5]:
(rawd
    .isnull()
    .sum()
    .to_frame("num_null")
    .assign(pct_null = lambda df: df["num_null"].divide(len(rawd)).multiply(100))
)

Unnamed: 0,num_null,pct_null
rxcui1,4897503,76.653016
rxaui1,1491682,23.346984
stype1,0,0.0
rel,0,0.0
rxcui2,4897503,76.653016
rxaui2,1491682,23.346984
stype2,0,0.0
rela,524493,8.209075
rui,500317,7.830686
sab,0,0.0


The RXCUI columns have more nulls.
There seems to be more relationships between RXAUIs.

### Column data types

In [6]:
rawd.dtypes

rxcui1    float64
rxaui1    float64
stype1     object
rel        object
rxcui2    float64
rxaui2    float64
stype2     object
rela       object
rui       float64
sab        object
dir       float64
cvf       float64
dtype: object

We will need to normalize the float columns to ints.

### Concept types

In [7]:
rawd["stype1"].value_counts()

AUI     4870937
CUI     1491682
SDUI      19300
SCUI       7266
Name: stype1, dtype: int64

In [8]:
rawd["stype2"].value_counts()

AUI     4870937
CUI     1491682
SDUI      19300
SCUI       7266
Name: stype2, dtype: int64

### Can we use the stype columns to only get the RXCUI relationships?

Check if we will miss any RXCUI relationships.

In [9]:
(rawd
    [rawd["rxcui1"].isnull()]
    ["stype1"].value_counts()
)

AUI     4870937
SDUI      19300
SCUI       7266
Name: stype1, dtype: int64

In [10]:
(rawd
    [rawd["rxcui2"].isnull()]
    ["stype1"].value_counts()
)

AUI     4870937
SDUI      19300
SCUI       7266
Name: stype1, dtype: int64

The stype columns will never contain "CUI" if we want RXCUI relationships.

In [11]:
(rawd
    [rawd["rxcui1"].notnull()]
    ["stype1"].value_counts()
)

CUI    1491682
Name: stype1, dtype: int64

In [12]:
(rawd
    [rawd["rxcui2"].notnull()]
    ["stype1"].value_counts()
)

CUI    1491682
Name: stype1, dtype: int64

The columns will only contain "CUI" if it's a RXCUI relationship.

---

## Use the stype columns to filter down to RXCUI relationships

In [13]:
cui_rels = (rawd
    .query("stype1 == 'CUI' and stype2 == 'CUI'")
    .drop(["stype1", "stype2"], axis=1)
    .dropna(how="all", axis=1)
    .assign(
        rxcui1 = lambda df: df["rxcui1"].astype("int64"),
        rxcui2 = lambda df: df["rxcui2"].astype("int64"),
        rui = lambda df: df["rui"].astype("int64")
    )
    .reset_index(drop=True)
)

In [14]:
cui_rels.shape

(1491682, 7)

In [15]:
cui_rels.head()

Unnamed: 0,rxcui1,rel,rxcui2,rela,rui,sab,cvf
0,38,RB,1760,has_tradename,4696871,RXNORM,4096.0
1,38,RO,105050,has_ingredient,4343918,RXNORM,4096.0
2,38,RO,105445,has_ingredient,4229336,RXNORM,
3,38,RO,105446,has_ingredient,3798489,RXNORM,4096.0
4,38,RO,105447,has_ingredient,4423580,RXNORM,


In [16]:
cui_rels.dtypes

rxcui1      int64
rel        object
rxcui2      int64
rela       object
rui         int64
sab        object
cvf       float64
dtype: object

In [17]:
cui_rels.isnull().sum()

rxcui1         0
rel            0
rxcui2         0
rela           0
rui            0
sab            0
cvf       968420
dtype: int64

In [18]:
cui_rels["sab"].value_counts()

RXNORM    1491682
Name: sab, dtype: int64

## Create final version

The source abbreviation column is not very useful since all the values are the same.
Neither is the CVF column for our purposes.

In [19]:
final = cui_rels.drop(["rui", "sab", "cvf"], axis=1)

In [20]:
final.shape

(1491682, 4)

In [21]:
final.head()

Unnamed: 0,rxcui1,rel,rxcui2,rela
0,38,RB,1760,has_tradename
1,38,RO,105050,has_ingredient
2,38,RO,105445,has_ingredient
3,38,RO,105446,has_ingredient
4,38,RO,105447,has_ingredient


### Information

In [22]:
final["rel"].value_counts()

RO    838920
RB    326381
RN    326381
Name: rel, dtype: int64

In [23]:
(final["rela"]
    .value_counts()
    .to_frame("num_rels")
    .reset_index()
    .rename(columns={"index": "rela"})
    .sort_values(["num_rels", "rela"], ascending=[False, True])
    .reset_index(drop=True)
)

Unnamed: 0,rela,num_rels
0,inverse_isa,202697
1,isa,202697
2,has_ingredient,155284
3,ingredient_of,155284
4,consists_of,110968
5,constitutes,110968
6,has_tradename,106417
7,tradename_of,106417
8,dose_form_of,89899
9,has_dose_form,89899


## Only one edge per unique node pair?

In [24]:
final.groupby(["rxcui1", "rxcui2"]).size().value_counts()

1    1491682
dtype: int64

Yes, there is only one edge between each unique CUI pair.

## Relationships are reciprocal?

In [25]:
set(final["rxcui1"]) == set(final["rxcui2"])

True

## Save to file

In [26]:
final.to_csv("../../pipeline/rxnorm/rxcui_rels.tsv", sep='\t', index=False)