This notebook is used for extracting dataset from pubmed files and converting them into required format. It also has code to extract from evaluation databases.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install biopython



We are using biopython python package, which supports Entrez API. The entrez API can query pubmed data.

In [None]:
from Bio import Entrez

def search(query):
    Entrez.email = 'your.email@example.com'
    handle = Entrez.esearch(db='pubmed',
                            sort='relevance',
                            retmax='10000',
                            retmode='xml',
                            term=query)
    results = Entrez.read(handle)
    return results

def fetch_details(id_list):
    ids = ','.join(id_list)
    Entrez.email = 'your.email@example.com'
    handle = Entrez.efetch(db='pubmed',
                           retmode='xml',
                           id=ids)
    results = Entrez.read(handle)
    return results

def get_abstract(paper):
    abstract = ''
    if 'Abstract' in paper['MedlineCitation']['Article']:
        abstract = paper['MedlineCitation']['Article']['Abstract']['AbstractText']
        if isinstance(abstract, list):
            abstract = ' '.join(abstract)
    return abstract

In [None]:

def findDataForTerm(searchTerm):
  text = ""
  results = search(searchTerm)
  id_list = results['IdList']
  papers = fetch_details(id_list)
  print("found ")
  print(len(papers['PubmedArticle']))
  print("occurences...")
  for i, paper in enumerate(papers['PubmedArticle']):

          abstract = get_abstract(paper)
          text+=abstract

  return text

In [None]:
def split_list(lst, chunk_size=50):
    """Split a list into sublists of specified chunk size."""
    for i in range(0, len(lst), chunk_size):
        yield lst[i:i + chunk_size]


In [None]:
import spacy
import time

nlp = spacy.load("en_core_web_sm")
nlp.max_length = 25000000

def create_dataset(text, words_to_search, entity_label):
    text = text.replace('\n', ' ')
    doc = nlp(text)
    dataset = []
    for sent in doc.sents:
        sentence_text = sent.text.strip()
        if sentence_text:

            for word in words_to_search:
                if word.lower() in sentence_text.lower():
                    start_pos = sentence_text.lower().find(word.lower())
                    end_pos = start_pos + len(word)
                    dataset.append((sentence_text, {'entities': [(start_pos, end_pos+1, entity_label)]}))
                    break

    return dataset

def read_text_from_file(file_path):
    with open(file_path, 'r') as file:
        return file.read()

file_path = "input_text.dat"
text = read_text_from_file("/content/abstract-38653920-set.txt")


words_to_search = ['Acanthosis nigricans', 'Acne keloidalis nuchae','Acne scars','Actinic keratosis','Alopecia areata', "Athlete's foot", 'Atopic dermatitis', 'Basal cell carcinoma', 'Bedbugs', 'Birthmarks', 'Boils and styes', 'Botulinum toxin', 'Bullous pemphigoid', 'Cellulitis', 'Central centrifugal cicatricial alopecia', 'CCCA', 'Chemical peels', 'Chickenpox', 'Cold sores', 'Contact dermatitis', 'Cradle cap', 'Cutaneous T-cell lymphoma', 'Dandruff', 'Dermatofibrosarcoma protuberan', 'DFSP', 'Diaper rash', 'Dry skin', 'Dyshidrotic eczema', 'Eczema', 'Epidermolysis bullosa', 'Female pattern hair loss', 'Folliculitis', 'Frontal fibrosing alopecia', 'Genital herpes', 'Genital warts', 'Granuloma annulare', 'Hair loss', 'Hand-foot-and-mouth disease', 'Head lice', 'Herpes simplex', 'Hidradenitis suppurativa', 'Hives', 'Hyperhidrosis', 'Ichthyosis vulgaris', 'Imiquimod', 'Impetigo', 'Isotretinoin', 'JAK Inhibitors', 'Keloid scars', 'Keratosis pilaris', 'Lasers', 'Leprosy', 'Lichen planus', 'Lupus', 'Lyme disease', 'Melanoma', 'Melasma', 'Merkel cell carcinoma', 'Moles', 'Molluscum contagiosum', 'Monkeypox rash', 'Nail fungus', 'Neurodermatitis', 'Nickel allergy', 'Nummular dermatitis', 'Pemphigus', 'Perioral dermatitis', 'Pityriasis rosea', 'Prurigo nodularis', 'Psoriasis', 'Psoriatic arthritis', 'Rashes', 'Ringworm', 'Rosacea', 'Sarcoidosis', 'Scabies', 'Scalp psoriasis', 'Scars', 'Scleroderma', 'Sebaceous carcinoma', 'Seborrheic dermatitis', 'Seborrheic keratoses', 'Shingles', 'Skin biopsy', 'Skin cancer', 'Skin tags', 'Squamous cell carcinoma', 'Stasis dermatitis', 'Stretch marks', 'Syphilis', 'Tinea versicolor', 'Vitiligo', 'Warts', 'Xeroderma pigmentosum']
dataset=[]
for word in words_to_search:
  texts_list = []
  print("searching for")
  print(word)
  print("................")
  text = findDataForTerm(word)
  print("found lines:")
  print(len(text))
  if(len(text)>2500000):
      texts_list = split_list(text,2500000)
  for chunck in texts_list:
    new_entities = create_dataset(text,[word],"Skin disease")
    print("found entities")
    print(len(new_entities))
    dataset+=new_entities
    time.sleep(10)

print(len(dataset))



In [None]:
dataset[-100:]

[('Our results demonstrated that sleep quality correlates with the postoperative effectiveness of ultrapulse fractional CO<sub>2</sub> laser in the treatment of atrophic acne scars.',
  {'entities': [(167, 178, 'Skin disease')]}),
 ('In search of a biocompatible implant for the correction of small deficiencies within the dermal corium as in wrinkles and acne scars, polymethylmethacrylate (PMMA) microspheres, 10 to 63 microns in diameter, were dispersed in Tween 80 medium and injected intradermally and subdermally into the abdominal skin of rats.',
  {'entities': [(122, 133, 'Skin disease')]}),
 ('Because PMMA products (Paladon, Palacos) have been used in medicine for almost 50 years without causing biological degradation or cancer, the material may be applied safely in the form of microspheres (Arteplast) in corium and subcutis of human patients with wrinkles or acne scars.',
  {'entities': [(271, 282, 'Skin disease')]}),
 ('However, the disorder might be underestimated probably becaus

In [None]:
len(dataset)

436039

In [111]:
file_path = "/output.txt"

dataset = []


with open(file_path, "r") as file:
    for line in file:
        line_data = eval(line.strip())
        dataset.append(line_data)


In [112]:
len(dataset)

436039

In [113]:
def remove_overlapping_entities(entities):
    entities.sort(key=lambda x: x[0])  # Sort entities by start index
    non_overlapping_entities = []

    for i, (start, end, label) in enumerate(entities):
        if i == 0:
            non_overlapping_entities.append((start, end, label))
        else:
            prev_start, prev_end, _ = non_overlapping_entities[-1]
            # Check for overlap
            if start >= prev_end:
                non_overlapping_entities.append((start, end, label))
            else:
                # If overlap, remove the shorter span
                if end - start > prev_end - prev_start:
                    non_overlapping_entities[-1] = (start, end, label)

    return non_overlapping_entities

In [114]:
def combine_entities_of_duplicates(dataset):

    combined_data = {}
    combined_sentences = []


    for sentence, entities in dataset:

        if sentence in combined_data:

            current_entity = entities['entities'][0]
            existing_entites = combined_data[sentence]

            if current_entity not in existing_entites:
              combined_data[sentence].append(entities['entities'][0])

        else:
            combined_data[sentence] = entities['entities']


    for sentence, entities in combined_data.items():
      current_entities ={
          "entities":remove_overlapping_entities(entities)
      }

      combined_sentences.append((sentence, current_entities))

    return combined_sentences

# Combine entities of duplicate sentences
op_data = combine_entities_of_duplicates(dataset)


In [115]:
print(len(op_data))

418338


In [118]:
with open('output-combined3.txt', 'w') as file:
    # Write each tuple to the file
    for item in op_data:
        # Convert the tuple to a string and write to file
        file.write("%s\n" % str(item))

In [119]:
file_path = "/content/output-combined3.txt"

dataset2 = []

# Read the content from the file and convert each line into a tuple
with open(file_path, "r") as file:
    for line in file:
        # Remove leading/trailing whitespace and evaluate the line as a tuple
        line_data = eval(line.strip())
        dataset2.append(line_data)

In [69]:
len(dataset2)

418338

We have downloaded and saved the extracted data to output-combined.txt

Now getting data for evaluation

In [1]:
!pip install datasets



In [2]:
from datasets import Dataset
import re
from datasets import load_dataset
import numpy as np
import datasets
import torch

In [3]:
bc5cdr = load_dataset("tner/bc5cdr")
jnlpba = load_dataset("jnlpba")
ncbi_dataset = load_dataset("ncbi_disease")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [1]:
content_list =  ['Acanthosis nigricans', 'Acne keloidalis nuchae','Acne scars','Actinic keratosis','Alopecia areata', "Athlete's foot", 'Atopic dermatitis', 'Basal cell carcinoma', 'Bedbugs', 'Birthmarks', 'Boils and styes', 'Botulinum toxin', 'Bullous pemphigoid', 'Cellulitis', 'Central centrifugal cicatricial alopecia', 'CCCA', 'Chemical peels', 'Chickenpox', 'Cold sores', 'Contact dermatitis', 'Cradle cap', 'Cutaneous T-cell lymphoma', 'Dandruff', 'Dermatofibrosarcoma protuberan', 'DFSP', 'Diaper rash', 'Dry skin', 'Dyshidrotic eczema', 'Eczema', 'Epidermolysis bullosa', 'Female pattern hair loss', 'Folliculitis', 'Frontal fibrosing alopecia', 'Genital herpes', 'Genital warts', 'Granuloma annulare', 'Hair loss', 'Hand-foot-and-mouth disease', 'Head lice', 'Herpes simplex', 'Hidradenitis suppurativa', 'Hives', 'Hyperhidrosis', 'Ichthyosis vulgaris', 'Imiquimod', 'Impetigo', 'Isotretinoin', 'JAK Inhibitors', 'Keloid scars', 'Keratosis pilaris', 'Lasers', 'Leprosy', 'Lichen planus', 'Lupus', 'Lyme disease', 'Melanoma', 'Melasma', 'Merkel cell carcinoma', 'Moles', 'Molluscum contagiosum', 'Monkeypox rash', 'Nail fungus', 'Neurodermatitis', 'Nickel allergy', 'Nummular dermatitis', 'Pemphigus', 'Perioral dermatitis', 'Pityriasis rosea', 'Prurigo nodularis', 'Psoriasis', 'Psoriatic arthritis', 'Rashes', 'Ringworm', 'Rosacea', 'Sarcoidosis', 'Scabies', 'Scalp psoriasis', 'Scars', 'Scleroderma', 'Sebaceous carcinoma', 'Seborrheic dermatitis', 'Seborrheic keratoses', 'Shingles', 'Skin biopsy', 'Skin cancer', 'Skin tags', 'Squamous cell carcinoma', 'Stasis dermatitis', 'Stretch marks', 'Syphilis', 'Tinea versicolor', 'Vitiligo', 'Warts', 'Xeroderma pigmentosum']
content_list = [line.lower().strip() for line in content_list]
skin_terms = set(content_list)

In [2]:
len(content_list)

94

In [5]:
def filter_samples(dataset, word_list):
    filtered_examples = []
    for item in dataset:
        for word in word_list:
            if word.lower() in item["tokens"]:  # Assuming "tokens" is the key containing the text in the dataset
                filtered_examples.append(item)
                break  # Break out of the inner loop if the word is found
    return filtered_examples

In [6]:
combined_samples = []
combined_samples.extend(bc5cdr["train"])
combined_samples.extend(bc5cdr["test"])
combined_samples.extend(bc5cdr["validation"])
combined_samples.extend(jnlpba["train"])
#
combined_samples.extend(jnlpba["validation"])
combined_samples.extend(ncbi_dataset["train"])
combined_samples.extend(ncbi_dataset["test"])
combined_samples.extend(ncbi_dataset["validation"])


In [7]:
filtered = filter_samples(combined_samples, content_list)
print(len(filtered))


122


In [8]:
import spacy

In [9]:
def convert_to_spacy_format(dataset,content_list):
    nlp = spacy.blank("en")
    spacy_data = []

    for sample in dataset:
        text = " ".join(sample['tokens'])
        entities = []
        char_index = 0
        start_char = None
        end_char = None

        for word in content_list:
          if word in text:
            start_pos = text.find(word)
            end_pos = start_pos + len(word)
            entities.append((start_pos, end_pos, "Skin disease"))
        if len(entities)>1: print(text, entities)
        spacy_data.append((text, {"entities": entities}))

    return spacy_data


In [10]:
spacy = convert_to_spacy_format(filtered,content_list)

BACKGROUND : Tacrolimus ointment is increasingly used for anti - inflammatory treatment of sensitive areas such as the face , and recent observations indicate that the treatment is effective in steroid - aggravated rosacea and perioral dermatitis . [(227, 246, 'Skin disease'), (215, 222, 'Skin disease')]
In 1 patient with atopic dermatitis , telangiectatic and papular rosacea insidiously appeared after 5 months of treatment . [(18, 35, 'Skin disease'), (65, 72, 'Skin disease')]
Skin rashes , proteinuria , systemic lupus erythematosus , polymyositis and myasthenia gravis have all been recorded as complications of penicillamine therapy in patients with rheumatoid arthritis . [(37, 42, 'Skin disease'), (5, 11, 'Skin disease')]


In [11]:
spacy[13:15]

[('8 x 10 ( 6 ) moles / glomerulus , p < 0 .',
  {'entities': [(13, 18, 'Skin disease')]}),
 ('Late - onset scleroderma renal crisis induced by tacrolimus and prednisolone : a case report .',
  {'entities': [(13, 24, 'Skin disease')]})]

In [12]:
len(spacy)

122

In [13]:
with open('eval.txt', 'w') as file:
    # Write each tuple to the file
    for item in spacy:
        # Convert the tuple to a string and write to file
        file.write("%s\n" % str(item))

In [14]:
file_path = "/content/eval.txt"

evaluation = []
print(eval)
# Read the content from the file and convert each line into a tuple
with open(file_path, "r") as file:
    for line in file:

        # Remove leading/trailing whitespace and evaluate the line as a tuple
        line_data = eval(line.strip())
        evaluation.append(line_data)

<built-in function eval>


In [16]:
print(len(evaluation))

122


In [19]:
evaluation[1:10]

[('Veno-occlusive liver disease after dacarbazine therapy ( DTIC ) for melanoma .',
  {'entities': [(68, 76, 'Skin disease')]}),
 ('A case of veno-occlusive disease of the liver with fatal outcome after dacarbazine ( DTIC ) therapy for melanoma is reported .',
  {'entities': [(104, 112, 'Skin disease')]}),
 ('These 13 included cases of malignant hypertension , thrombotic microangiopathy , lupus nephritis , Henoch-Schonlein nephritis , crescentic glomerulonephritis , and cocaine - related acute renal failure .',
  {'entities': [(81, 86, 'Skin disease')]}),
 ('Treatment of psoriasis with azathioprine .',
  {'entities': [(13, 22, 'Skin disease')]}),
 ('Azathioprine treatment benefited 19 ( 66 %) out of 29 patients suffering from severe psoriasis .',
  {'entities': [(85, 94, 'Skin disease')]}),
 ('Is the treatment of scabies hazardous ? Treatment for scabies is usually initiated by general practitioners ; most consider lindane ( gamma benzene hexachloride ) the treatment of choice .',
  {'

On succesfull execution of the notebook it should generate two files output-combined3.txt (which is our main training data) and eval.txt(for evaluation)