## Fine Tuning for Medical Coding  
#### Part 1: Data Preparation  

---

**Goal for this Notebook**
- Prepare a dataset to fine tune a model for L1 (Chapter) level classification. The reason for fine-tuning at a 'higher' level is to eliminate challenges assocaited with the long-tail problem. This exercise will fine tune a model for multi-label classification with 17 label options.
  
<small>[_Click here for a complete list of ICD9 Chapters_](https://en.wikipedia.org/wiki/List_of_ICD-9_codes)</small>

**Approach**
  
The dataset will be created using the ICD9 code tree to create descriptions for chapter classifications. For example, for a single chapter we can create many rows for training by using the ICD code name, child code names, and supplement information from UMLS.
<small>
```markdown
--> Chapter name / description  
    --> UMLS concept atoms  
    --> UMLS concept definitions  
    --> Children, grandchildren, etc node names / descriptions  
        --> UMLS concept atoms  
        --> UMLS concept definitions  
```
</small>  

**Data**

The dataset will need to be formatted in json format as follows:
<small>
```json
{"messages": [{"role": "system", "content": "<SYSTEM MSG>"}, {"role": "user", "content": "<PROMPT>"}, {"role": "assistant", "content": "<CODE>"}]}
{"messages": [{"role": "system", "content": "<SYSTEM MSG>"}, {"role": "user", "content": "<PROMPT>"}, {"role": "assistant", "content": "<CODE>"}]}
{"messages": [{"role": "system", "content": "<SYSTEM MSG>"}, {"role": "user", "content": "<PROMPT>"}, {"role": "assistant", "content": "<CODE>"}]}
```
</small>

In [166]:
from azure.ai.textanalytics import TextAnalyticsClient
from azure.core.credentials import AzureKeyCredential
from src.icd9_tree import ICD9
from dotenv import load_dotenv, find_dotenv
from textwrap import dedent

import pandas as pd
import numpy as np
import random
import requests
import os
import re

load_dotenv(find_dotenv(), override=True)
pd.set_option('display.max_colwidth', None)

In [2]:
# Authenticate to Client
# Authenticate the client using your key and endpoint 
key = os.getenv("LANGUAGE_KEY")
endpoint = os.getenv("LANGUAGE_ENDPOINT")

ta_credential = AzureKeyCredential(key)
client = TextAnalyticsClient(
        endpoint=endpoint, 
        credential=ta_credential)

---
#### Setup Code Tree

In [3]:
# Read ICD9 codes in as a tree and view top level. These 'Chapter' codes will be the labels for our fine tuned model.

tree = ICD9('icd9_codes_full.json')
chapter_codes = []
# list of top level codes (e.g., '001-139', ...)
toplevelnodes = tree.children
for node in toplevelnodes:
    if node.code[0] not in ['E', 'V']:
        print(node.code, node.description)
        chapter_codes.append(node.description)

001-139 INFECTIOUS AND PARASITIC DISEASES 
140-239 NEOPLASMS 
240-279 ENDOCRINE, NUTRITIONAL AND METABOLIC DISEASES, AND IMMUNITY DISORDERS 
280-289 DISEASES OF THE BLOOD AND BLOOD-FORMING ORGANS 
290-319 MENTAL DISORDERS 
320-389 DISEASES OF THE NERVOUS SYSTEM AND SENSE ORGANS 
390-459 DISEASES OF THE CIRCULATORY SYSTEM 
460-519 DISEASES OF THE RESPIRATORY SYSTEM 
520-579 DISEASES OF THE DIGESTIVE SYSTEM 
580-629 DISEASES OF THE GENITOURINARY SYSTEM 
630-679 COMPLICATIONS OF PREGNANCY, CHILDBIRTH, AND THE PUERPERIUM 
680-709 DISEASES OF THE SKIN AND SUBCUTANEOUS TISSUE 
710-739 DISEASES OF THE MUSCULOSKELETAL SYSTEM AND CONNECTIVE TISSUE 
740-759 CONGENITAL ANOMALIES 
760-779 CERTAIN CONDITIONS ORIGINATING IN THE PERINATAL PERIOD 
780-799 SYMPTOMS, SIGNS, AND ILL-DEFINED CONDITIONS 
800-999 INJURY AND POISONING 


---
#### Establish Helper Functions

In [8]:
# Get code from description
def get_chapter_code(description):
    code = None
    for node in tree.children:
        if node.description.strip() == description.strip():
            code = node.code
            break
        else:
            for child in node.children:
                if child.description.strip() == description.strip():
                    code = child.code
                    break
    return code

print(get_chapter_code('COMPLICATIONS OF PREGNANCY, CHILDBIRTH, AND THE PUERPERIUM'))

630-679


In [10]:
# Function to get the UMLS CUID(s) for a given text
# This function uses Azure Text Analytics for Health

def get_umls_concepts(client, documents):
    umls_concepts = []
    poller = client.begin_analyze_healthcare_entities(documents)
    result = poller.result()

    docs = [doc for doc in result if not doc.is_error]

    for idx, doc in enumerate(docs):
        for entity in doc.entities:
            if entity.data_sources and entity.category in ['SymptonOrSign', 'Diagnosis']:
                for data_source in entity.data_sources:
                    if data_source.name == "UMLS":
                        umls_concepts.append(data_source.entity_id)

    return umls_concepts

In [11]:
# Function to get the UMLS atoms from a cuid

def get_umls_atoms(cuid):
    synonyms = []
    sabs = ['ICD10', 'ICD10CM', 'ICD9CM', 'SNOMEDCT_US', 'MDR']      
    atom_uri = f"https://uts-ws.nlm.nih.gov/rest/content/2024AA/CUI/{cuid}/atoms"
    page = 0  
    try:   
        while True:
            page += 1
            atom_query = {'apiKey':os.getenv("UMLS_API_KEY"), 'pageNumber':page, 'language':'ENG', 'sabs': ','.join(sabs)}
            a = requests.get(atom_uri, params=atom_query)
            a.encoding = 'utf-8'
            
            if a.status_code != 200:
                break

            all_atoms = a.json()
        
            for atom in all_atoms['result']:
                synonyms.append(re.sub("[\(\[].*?[\)\]]", "", atom['name']).lower().rstrip())
                #print(f'{atom}')

            return list(set(synonyms))
            
    except Exception as except_error:
        print(except_error)
        return

In [12]:
# Function to get UMLS definition list from a cuid

def umls_define(cuid):    
    definitions = []
    umls_uri = f"https://uts-ws.nlm.nih.gov/rest/content/current/CUI/{cuid}/definitions"
    root_sources = ['CSP','NCI','MSH','PDQ', 'MTH', 'HPO', 'DXP', 'SNMI', 'SNOMEDCT_US', 'ICD10CM', 'ICD10', 'ICD9CM', 'MDR']  
    page = 0  
    try:   
        while True:
            page += 1
            query = {'apiKey':os.getenv("UMLS_API_KEY"), 'pageNumber':page}
            a = requests.get(umls_uri, params=query)
            a.encoding = 'utf-8'
            
            if a.status_code != 200:
                break
            result = a.json()
        
            for value in result['result']:
                if value['rootSource'] in root_sources:
                    definitions.append(value['value'].lower().rstrip())

            return list(set(definitions))
            
    except Exception as except_error:
        print(except_error)
        return

In [13]:
# generate pd dataset

def generate_dataset(description, chapter, az_ta_cli, dataset_list):
    dataset_list.append({'description': description, 'chapter': chapter})

    umls_concepts = get_umls_concepts(az_ta_cli, [description])
    for cuid in umls_concepts:
        
        atoms = get_umls_atoms(cuid)
        if atoms:
            for atom in atoms:
                dataset_list.append({'description': atom, 'chapter': chapter})

        definitions = umls_define(cuid)
        if definitions:
            for definition in definitions:
                dataset_list.append({'description': definition, 'chapter': chapter})
    return

# Test
test_list = []
generate_dataset('COMPLICATIONS OF PREGNANCY, CHILDBIRTH, AND THE PUERPERIUM', '001-139', client, test_list)
print(test_list)

[{'description': 'COMPLICATIONS OF PREGNANCY, CHILDBIRTH, AND THE PUERPERIUM', 'chapter': '001-139'}, {'description': 'pregnancy, function', 'chapter': '001-139'}, {'description': 'pregnancy', 'chapter': '001-139'}, {'description': 'gestation', 'chapter': '001-139'}, {'description': 'pregnancy nos', 'chapter': '001-139'}, {'description': 'patient is currently pregnant.', 'chapter': '001-139'}, {'description': 'condition of having a developing embryo or fetus in the body.', 'chapter': '001-139'}, {'description': 'the status during which female mammals carry their developing young (embryos or fetuses) in utero before birth, beginning from fertilization to birth.', 'chapter': '001-139'}, {'description': 'patient currently pregnant', 'chapter': '001-139'}, {'description': 'the state or condition of having a developing embryo or fetus in the body (uterus), after union of an ovum and spermatozoon, during the period from conception to birth.', 'chapter': '001-139'}, {'description': 'parturiti

---
#### Create Fine Tuning Training Dataset

In [106]:
# Build the dataset 
# NOTE: This may take a while (15 min per 1500 samples)
# TODO: Make this more efficient

ft_df_list = []

for L1_node in tree.children:
    if L1_node.code[0] not in ['E', 'V']:
        # print(f"L1: {L1_node.code} - {L1_node.description}")
        generate_dataset(L1_node.description, L1_node.description, client, ft_df_list)
        for L2_node in L1_node.children:
            # print(f"L2: {L2_node.code} - {L2_node.description}")
            generate_dataset(L2_node.description, L1_node.description, client, ft_df_list)
            for L3_node in L2_node.children:
                # print(f"L3: {L3_node.code} - {L3_node.description}")
                # generate_dataset(L3_node.description, L1_node.code, client, ft_df_list)
                for L4_node in L3_node.children:
                    # print(f"L4: {L4_node.code} - {L4_node.description}")
                    # generate_dataset(L4_node.description, L1_node.code, client, ft_df_list)
                    for L5_node in L4_node.children:
                        # print(f"L5: {L5_node.code} - {L5_node.description}")
                        # generate_dataset(L5_node.description, L1_node.code, client, ft_df_list)
                        pass

In [169]:
# Examine data
ft_df = pd.DataFrame(ft_df_list)
ft_df.chapter = ft_df.chapter.apply(lambda x: x.strip())
print(ft_df.shape)
print(ft_df.dtypes)

(1618, 2)
description    object
chapter        object
dtype: object


In [159]:
# Add multi-label examples to the dataframe (normal distribution with a mean of 6 labels per example [min 1, max 12])

def multi_sample(code_count, sample_count):
    new_rows = []
    for i in range(sample_count):
        code_samples = list(map(str.strip, random.sample(chapter_codes, code_count)))
        item = {'description': '', 'chapter': ';'.join(code_samples)}
        desciption_list = []
        for chapter in code_samples:
            sample = ft_df[ft_df['chapter']==chapter].sample(1)
            desciption_list.append(sample['description'].values[0])

        item['description'] = ','.join(desciption_list)   
        new_rows.append(item)
    return new_rows


# TEST
new_rows = multi_sample(6, 3)
display(pd.DataFrame(new_rows))

Unnamed: 0,description,chapter
0,"disorder of peripheral nervous system nos,mort...",DISEASES OF THE NERVOUS SYSTEM AND SENSE ORGAN...
1,"neuroendocrine neoplasm,NONSPECIFIC ABNORMAL F...","NEOPLASMS;SYMPTOMS, SIGNS, AND ILL-DEFINED CON..."
2,any abnormal condition of the body or mind tha...,DISEASES OF THE MUSCULOSKELETAL SYSTEM AND CON...


In [165]:
# Add multi-label examples to the dataframe
# TODO: Add data according to the distribution of the MIMIC-III dataset

ft_df = pd.concat([ft_df, pd.DataFrame(multi_sample(2,150))], ignore_index=True)
ft_df = pd.concat([ft_df, pd.DataFrame(multi_sample(3,200))], ignore_index=True)
ft_df = pd.concat([ft_df, pd.DataFrame(multi_sample(4,300))], ignore_index=True)
ft_df = pd.concat([ft_df, pd.DataFrame(multi_sample(5,500))], ignore_index=True)
ft_df = pd.concat([ft_df, pd.DataFrame(multi_sample(6,1000))], ignore_index=True)
ft_df = pd.concat([ft_df, pd.DataFrame(multi_sample(7,600))], ignore_index=True)
ft_df = pd.concat([ft_df, pd.DataFrame(multi_sample(8,500))], ignore_index=True)
ft_df = pd.concat([ft_df, pd.DataFrame(multi_sample(9,400))], ignore_index=True)
ft_df = pd.concat([ft_df, pd.DataFrame(multi_sample(10,300))], ignore_index=True)
ft_df = pd.concat([ft_df, pd.DataFrame(multi_sample(11,200))], ignore_index=True)
ft_df = pd.concat([ft_df, pd.DataFrame(multi_sample(12,150))], ignore_index=True)

(5918, 2)


Unnamed: 0,description,chapter
759,inflammatory rheumatism,DISEASES OF THE CIRCULATORY SYSTEM
4844,"pneumoconiosis,NONSPECIFIC ABNORMAL FINDINGS ,...","DISEASES OF THE RESPIRATORY SYSTEM;SYMPTOMS, S..."
4190,"patient is currently pregnant.,top term headin...","COMPLICATIONS OF PREGNANCY, CHILDBIRTH, AND TH..."
2613,"disorder of ear, unspecified,coagulation defec...",DISEASES OF THE NERVOUS SYSTEM AND SENSE ORGAN...
561,disorder of hemostatic system,DISEASES OF THE BLOOD AND BLOOD-FORMING ORGANS
4645,"all deaths reported in a given population.,dis...",CERTAIN CONDITIONS ORIGINATING IN THE PERINATA...
2094,a genus of the family chlamydiaceae whose spec...,INFECTIOUS AND PARASITIC DISEASES;DISEASES OF ...
1833,"RHEUMATISM, EXCLUDING THE BACK ,disorder of ma...",DISEASES OF THE MUSCULOSKELETAL SYSTEM AND CON...
2343,"diseases,OTHER CONDITIONS ORIGINATING IN THE P...",DISEASES OF THE MUSCULOSKELETAL SYSTEM AND CON...
54,hiv infection,INFECTIOUS AND PARASITIC DISEASES


In [167]:
print(ft_df.shape)
display(ft_df.sample(10))

(5918, 2)


Unnamed: 0,description,chapter
3487,"a primary or metastatic malignant neoplasm involving the lip.,the status during which female mammals carry their developing young (embryos or fetuses) in utero before birth, beginning from fertilization to birth.,the proportion of patients with a particular disease during a given year per given unit of population.,aplastic anaemia,fracture of skull,INFECTIONS OF SKIN AND SUBCUTANEOUS TISSUE","NEOPLASMS;COMPLICATIONS OF PREGNANCY, CHILDBIRTH, AND THE PUERPERIUM;SYMPTOMS, SIGNS, AND ILL-DEFINED CONDITIONS;DISEASES OF THE BLOOD AND BLOOD-FORMING ORGANS;INJURY AND POISONING;DISEASES OF THE SKIN AND SUBCUTANEOUS TISSUE"
5493,"peripheral neuropathy,patient currently pregnant,bacterial infection of unspecified site,deficiency anaemia,neoplasm malignant,an abnormality of the nervous system that is present at birth or detected in the neonatal period.,fracture of lower leg,disorders of any of the organs that are associated with ingestion, digestion, and absorption of food.,disorder of the circulatory system,cartilage disorders","DISEASES OF THE NERVOUS SYSTEM AND SENSE ORGANS;COMPLICATIONS OF PREGNANCY, CHILDBIRTH, AND THE PUERPERIUM;INFECTIOUS AND PARASITIC DISEASES;DISEASES OF THE BLOOD AND BLOOD-FORMING ORGANS;NEOPLASMS;CONGENITAL ANOMALIES;INJURY AND POISONING;DISEASES OF THE DIGESTIVE SYSTEM;DISEASES OF THE CIRCULATORY SYSTEM;DISEASES OF THE MUSCULOSKELETAL SYSTEM AND CONNECTIVE TISSUE"
5389,"congenital abnormality of urinary system,infection by treponema pallidum,osteopathia,disorder of pulmonary circulation,top term heading for all specific disorders and diseases; a disease is a deviation from or interruption of the normal structure or function of any part, organ or system (or combination thereof) of the body that is manifested by a characteristic set of symptoms and signs; a disorder is a derangement or abnormality of function.,nutritional deficiency, unspecified,NEUROTIC DISORDERS, PERSONALITY DISORDERS, AND OTHER NONPSYCHOTIC MENTAL DISORDERS ,NEOPLASMS ,disorders,disorders of any of the organs that are associated with ingestion, digestion, and absorption of food.","CONGENITAL ANOMALIES;INFECTIOUS AND PARASITIC DISEASES;DISEASES OF THE MUSCULOSKELETAL SYSTEM AND CONNECTIVE TISSUE;DISEASES OF THE CIRCULATORY SYSTEM;DISEASES OF THE GENITOURINARY SYSTEM;ENDOCRINE, NUTRITIONAL AND METABOLIC DISEASES, AND IMMUNITY DISORDERS;MENTAL DISORDERS;NEOPLASMS;DISEASES OF THE SKIN AND SUBCUTANEOUS TISSUE;DISEASES OF THE DIGESTIVE SYSTEM"
5349,"infectious and parasitic diseases,clinical disease and/or syndrome,hereditary hemolytic anemias,the proportion of deaths occurring in a population over a specified time.,OTHER MATERNAL AND FETAL COMPLICATIONS ,neurotic disorder,neurological disorder nos,the proportion of deaths occurring in a population over a specified time.,abdominal hernia,NUTRITIONAL DEFICIENCIES","INFECTIOUS AND PARASITIC DISEASES;DISEASES OF THE SKIN AND SUBCUTANEOUS TISSUE;DISEASES OF THE BLOOD AND BLOOD-FORMING ORGANS;SYMPTOMS, SIGNS, AND ILL-DEFINED CONDITIONS;COMPLICATIONS OF PREGNANCY, CHILDBIRTH, AND THE PUERPERIUM;MENTAL DISORDERS;DISEASES OF THE NERVOUS SYSTEM AND SENSE ORGANS;CERTAIN CONDITIONS ORIGINATING IN THE PERINATAL PERIOD;DISEASES OF THE DIGESTIVE SYSTEM;ENDOCRINE, NUTRITIONAL AND METABOLIC DISEASES, AND IMMUNITY DISORDERS"
1294,"structural or functional abnormalities of the central or peripheral nervous system existing at birth and often before birth, resulting primarily from defects of embryogenesis.",CONGENITAL ANOMALIES
4373,"FRACTURE OF UPPER LIMB ,ihd,any of a variety of disorders marked by inflammation, degeneration, or metabolic derangement of the connective tissue structures of the body, especially the joints and related structures, including muscles, bursae, tendons, and fibrous tissue.,unspecified infectious and parasitic diseases,chronic obstructive airway disease,character neurosis nos,any abnormal condition of the body or mind that causes discomfort, dysfunction, or distress to the person affected or those in contact with the person. the term is often used broadly to include injuries, disabilities, syndromes, symptoms, deviant behaviors, and atypical variations of structure and function.,pathological processes involving the male reproductive tract (genitalia, male).","INJURY AND POISONING;DISEASES OF THE CIRCULATORY SYSTEM;DISEASES OF THE MUSCULOSKELETAL SYSTEM AND CONNECTIVE TISSUE;INFECTIOUS AND PARASITIC DISEASES;DISEASES OF THE RESPIRATORY SYSTEM;MENTAL DISORDERS;ENDOCRINE, NUTRITIONAL AND METABOLIC DISEASES, AND IMMUNITY DISORDERS;DISEASES OF THE GENITOURINARY SYSTEM"
2407,"chronic airway obstruction,INFECTIONS OF SKIN AND SUBCUTANEOUS TISSUE ,neurological disorder nos,ischaemia myocardial,a non-neoplastic or neoplastic disorder that affects the esophagus. representative examples of non-neoplastic disorders include esophagitis and esophageal ulcer. representative examples of neoplastic disorders include carcinomas, lymphomas, and melanomas.",DISEASES OF THE RESPIRATORY SYSTEM;DISEASES OF THE SKIN AND SUBCUTANEOUS TISSUE;DISEASES OF THE NERVOUS SYSTEM AND SENSE ORGANS;DISEASES OF THE CIRCULATORY SYSTEM;DISEASES OF THE DIGESTIVE SYSTEM
5892,"cartilage disorder,helminthosis,disorder,patient is currently pregnant.,oesophageal disorder,disorders of the peripheral nervous system,neoplasm malignant,NEUROTIC DISORDERS, PERSONALITY DISORDERS, AND OTHER NONPSYCHOTIC MENTAL DISORDERS ,generalized arterial disease,nephrosis,intracranial injury,the proportion of deaths occurring in a population over a specified time.","DISEASES OF THE MUSCULOSKELETAL SYSTEM AND CONNECTIVE TISSUE;INFECTIOUS AND PARASITIC DISEASES;ENDOCRINE, NUTRITIONAL AND METABOLIC DISEASES, AND IMMUNITY DISORDERS;COMPLICATIONS OF PREGNANCY, CHILDBIRTH, AND THE PUERPERIUM;DISEASES OF THE DIGESTIVE SYSTEM;DISEASES OF THE NERVOUS SYSTEM AND SENSE ORGANS;NEOPLASMS;MENTAL DISORDERS;DISEASES OF THE CIRCULATORY SYSTEM;DISEASES OF THE GENITOURINARY SYSTEM;INJURY AND POISONING;CERTAIN CONDITIONS ORIGINATING IN THE PERINATAL PERIOD"
2784,"NEOPLASMS OF UNSPECIFIED NATURE ,congenital anomaly nos of urinary system,nutritional deficiency, unspecified,disorder circulatory system,chondropathies,pregnancy nos","NEOPLASMS;CONGENITAL ANOMALIES;ENDOCRINE, NUTRITIONAL AND METABOLIC DISEASES, AND IMMUNITY DISORDERS;DISEASES OF THE CIRCULATORY SYSTEM;DISEASES OF THE MUSCULOSKELETAL SYSTEM AND CONNECTIVE TISSUE;COMPLICATIONS OF PREGNANCY, CHILDBIRTH, AND THE PUERPERIUM"
5249,"hiv infection,ear disorder nos,leg fracture,patient is currently pregnant.,anomaly congenital,disorder,the collection of organs and tissues, including the ovaries, genital tract, and breasts, that have several functions, including sexual maturation, pregnancy, and childbirth.,disorder cerebrovascular,DISEASES OF THE SKIN AND SUBCUTANEOUS TISSUE","INFECTIOUS AND PARASITIC DISEASES;DISEASES OF THE NERVOUS SYSTEM AND SENSE ORGANS;INJURY AND POISONING;COMPLICATIONS OF PREGNANCY, CHILDBIRTH, AND THE PUERPERIUM;CONGENITAL ANOMALIES;ENDOCRINE, NUTRITIONAL AND METABOLIC DISEASES, AND IMMUNITY DISORDERS;DISEASES OF THE GENITOURINARY SYSTEM;DISEASES OF THE CIRCULATORY SYSTEM;DISEASES OF THE SKIN AND SUBCUTANEOUS TISSUE"


In [None]:
# define system prompt
# system prompt will be a constant in all examples
sys = 'Classify the following text into an ICD9 code chapter. The text is a clinical note from a patient medical record. ### You must choose from the following semi-colon delimited list of codes:{0} ### RESPOND ONLY WITH A CODE FROM THE LIST ABOVE.'.format('; '.join(chapter_codes))
print(dedent(sys))

In [110]:
# apply formatting to each row
ft_df["chapter"] = ft_df.chapter.apply(lambda x: {"role": "assistant", "content": x})
ft_df["description"] = ft_df.description.apply(lambda x: {"role": "user", "content": x})
ft_df['sys'] = sys
ft_df["sys"] = ft_df.sys.apply(lambda x: {"role": "system", "content": x})

out_df = pd.DataFrame()
out_df = ft_df.apply(lambda x: {"messages": x.values}, axis=1)

display(out_df.head(10))

0    {'messages': [{'role': 'user', 'content': 'INF...
1    {'messages': [{'role': 'user', 'content': 'inf...
2    {'messages': [{'role': 'user', 'content': 'uns...
3    {'messages': [{'role': 'user', 'content': 'inf...
4    {'messages': [{'role': 'user', 'content': 'INT...
5    {'messages': [{'role': 'user', 'content': 'int...
6    {'messages': [{'role': 'user', 'content': 'int...
7    {'messages': [{'role': 'user', 'content': 'int...
8    {'messages': [{'role': 'user', 'content': 'int...
9    {'messages': [{'role': 'user', 'content': 'inf...
dtype: object

In [111]:
# write to file
output_file_name = "data/ft/training_data_L1toL2_multi.jsonl"
out_df.to_json(output_file_name, orient="records", lines=True)

---
#### Examine the Dataset

In [112]:
df = pd.read_json(output_file_name, lines=True)
print(df.shape)

(1618, 1)


In [113]:
# Long tail problem no more

df['code'] = df['messages'].apply(lambda x: x[1]['content'])
print(df['code'].value_counts())

code
CONGENITAL ANOMALIES                                                      271
INFECTIOUS AND PARASITIC DISEASES                                         243
DISEASES OF THE BLOOD AND BLOOD-FORMING ORGANS                            177
NEOPLASMS                                                                 155
DISEASES OF THE CIRCULATORY SYSTEM                                        121
INJURY AND POISONING                                                       92
DISEASES OF THE GENITOURINARY SYSTEM                                       90
DISEASES OF THE RESPIRATORY SYSTEM                                         84
DISEASES OF THE DIGESTIVE SYSTEM                                           75
DISEASES OF THE NERVOUS SYSTEM AND SENSE ORGANS                            58
MENTAL DISORDERS                                                           58
ENDOCRINE, NUTRITIONAL AND METABOLIC DISEASES, AND IMMUNITY DISORDERS      56
COMPLICATIONS OF PREGNANCY, CHILDBIRTH, AND THE PUERPERIUM 