# Vocabulary Generator from SNMI

In [1]:
# Install requirements
! pip install -r requirements.txt

You should consider upgrading via the '/home/carlos/anaconda3/envs/spacy/bin/python3.6 -m pip install --upgrade pip' command.[0m


In [2]:
# importing libraries
import os
import pandas as pd

## Load CSV from Bioportal

Download the latest from the [SNMI website](https://bioportal.bioontology.org/ontologies/SNMI)

In [3]:
snmi = pd.read_csv("SNMI.csv")

In [4]:
# Looking at the data
snmi.head(10)

Unnamed: 0,Class ID,Preferred Label,Synonyms,Definitions,Obsolete,CUI,Semantic Types,Parents,Associated with,EZ,Has location,Location of,SB,Semantic type UMLS property,SHF,SIC,SMX
0,http://purl.bioontology.org/ontology/SNMI/C-140A7,235m Uranium,,,False,C0303352,http://purl.bioontology.org/ontology/STY/T196|...,http://purl.bioontology.org/ontology/SNMI/C-14000,,,,,,http://purl.bioontology.org/ontology/STY/T196|...,,,
1,http://purl.bioontology.org/ontology/SNMI/T-A1950,Choroid plexus of fourth ventricle,,,False,C0152293,http://purl.bioontology.org/ontology/STY/T023,http://purl.bioontology.org/ontology/SNMI/T-A1900,,,,,,http://purl.bioontology.org/ontology/STY/T023,,C71.7,
2,http://purl.bioontology.org/ontology/SNMI/L-61420,"Rasahus, NOS",,,False,C0322617,http://purl.bioontology.org/ontology/STY/T204,http://purl.bioontology.org/ontology/SNMI/L-61300,,,,,,http://purl.bioontology.org/ontology/STY/T204,,,
3,http://purl.bioontology.org/ontology/SNMI/DC-1...,HNSHA due to gamma glutamyl cysteine synthetas...,,,False,C0272071,http://purl.bioontology.org/ontology/STY/T047,http://purl.bioontology.org/ontology/SNMI/DC-1...,,,,,,http://purl.bioontology.org/ontology/STY/T047,,,
4,http://purl.bioontology.org/ontology/SNMI/P1-A...,Repair of perforating laceration of sclera wit...,,,False,C0197629,http://purl.bioontology.org/ontology/STY/T061,http://purl.bioontology.org/ontology/SNMI/P1-A...,,,,,,http://purl.bioontology.org/ontology/STY/T061,,,
5,http://purl.bioontology.org/ontology/SNMI/F-671B0,"beta-Carotene 15,15'-dioxygenase",,,False,C0053397,http://purl.bioontology.org/ontology/STY/T126|...,http://purl.bioontology.org/ontology/SNMI/F-66100,,1.13.11.21,,,,http://purl.bioontology.org/ontology/STY/T126|...,,,
6,http://purl.bioontology.org/ontology/SNMI/C-D4163,MITABAN,,,False,C0702074,http://purl.bioontology.org/ontology/STY/T109|...,http://purl.bioontology.org/ontology/SNMI/C-D0000,,,,,V,http://purl.bioontology.org/ontology/STY/T109|...,,,(X-10186)
7,http://purl.bioontology.org/ontology/SNMI/J-43230,Manufacturers' agent,,,False,C0335279,http://purl.bioontology.org/ontology/STY/T097,http://purl.bioontology.org/ontology/SNMI/J-43200,,,,,,http://purl.bioontology.org/ontology/STY/T097,,,
8,http://purl.bioontology.org/ontology/SNMI/C-D2165,CLOSTRIDIAL 7-WAY PLUS SOMNUMUNE,,,False,C0308688,http://purl.bioontology.org/ontology/STY/T116|...,http://purl.bioontology.org/ontology/SNMI/C-D0000,,,,,V,http://purl.bioontology.org/ontology/STY/T116|...,,,(X-20153)
9,http://purl.bioontology.org/ontology/SNMI/F-662E8,2-Hydroxy-3-oxopropionate reductase,Tartronate semialdehyde reductase,,False,C0311493,http://purl.bioontology.org/ontology/STY/T126|...,http://purl.bioontology.org/ontology/SNMI/F-66100,,1.1.1.60,,,,http://purl.bioontology.org/ontology/STY/T126|...,,,


Looking at the data, we only need the `Preferred Label` and the `Synonyms` columns to generate a vocabulary text file.

In [5]:
features = ['Preferred Label','Synonyms']
df = snmi[features]

In [6]:
df.shape

(109150, 2)

In [7]:
# Appending the Synonyms columns to the Preferred Label column
df = df['Preferred Label'].append(df['Synonyms'])

In [8]:
df.shape

(218300,)

Now we need to remove blank values.

In [9]:
cleaned_df = df.dropna()
cleaned_df

0                                              235m Uranium
1                        Choroid plexus of fourth ventricle
2                                              Rasahus, NOS
3         HNSHA due to gamma glutamyl cysteine synthetas...
4         Repair of perforating laceration of sclera wit...
                                ...                        
109131     Metastatic malignant neoplasm to pituitary gland
109137        Metastatic malignant neoplasm to skin of knee
109140                       Anaemia due to zinc deficiency
109142                                             Otoliths
109144    Lymphocytic inflammatory cell infiltrate, NOS|...
Length: 141274, dtype: object

We can use the `\W` regular expression (AKA RegEx) to split by non-alphanumeric characters (whitespaces, punctuation,  symbols, etc) and get all unique words for our vocabulary.

In [10]:
vocab = cleaned_df.str.split("\W",expand=True).stack().unique()

In [11]:
# sanity check
vocab

array(['235m', 'Uranium', 'Choroid', ..., 'Ureteroplasty',
       'FLOCCULONODULAR', 'Otoliths'], dtype=object)

In [12]:
# making sure there are no blank spaces
final_vocab = filter(None, vocab)

Now we can save our file to a text (*.txt) file to use in Natural Language Processing (NLP). 

In [13]:
filepath = "../vocab.txt"
with open(filepath, 'w') as file_handler:
    for item in final_vocab:
        file_handler.write("{}\n".format(item))