# Data Collection from MedlinePlus Genetic Conditions

We got our data from MedlinePlus, an online health information resource run by the NIH. We went into each genetic condition listed in the Genetic Conditions page, scraped their description, frequency, causes, and inheritance data.

In [None]:
import requests
from bs4 import BeautifulSoup

# get genetic conditions
url = 'https://medlineplus.gov/genetics/condition/'

# Send a GET request to the URL and store the response
response = requests.get(url)
print(response.content)

# Parse the HTML response using Beautiful Soup
soup = BeautifulSoup(response.text, 'html.parser')

b'\n<!DOCTYPE html>\n<html lang="en" id="general" class="nojs us" data-root="https://medlineplus.gov/">\n\n    <head>\n\n        <meta charset="utf-8" />\n        <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />\n        <meta http-equiv="window-target" content="_top" />\n        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />\n        <meta name="viewport" content="width=device-width, initial-scale=1" />\n\n        \n\n        \n\n        <link rel="canonical" href="https://medlineplus.gov/genetics/condition/" />\n\n        \n\n  \n        <link href="https://medlineplus.gov/genetics/condition/" hreflang="x-default" rel="alternate">\n  \n\n\n\n        <meta name="ac-dictionary" content="medlineplus-ac-dictionary" />\n\n        \n\n        \n\n        <link rel="shortcut icon" href="https://medlineplus.gov/images/favicon.ico" type="image/x-icon" />\n        <link rel="apple-touch-icon" href="https://medlineplus.gov/images/touch-icon.png" />\n\n    

Isolating the hrefs that lead to a disease page

In [None]:
#check disease urls
urls = []
for link in soup.find_all('a'):
    href = link.get('href')
    if href and '/condition/' in href:
        urls.append(href)
print(urls)

['https://support.nlm.nih.gov/knowledgebase/category/?id=CAT-01231&category=medlineplus&from=https%3A//medlineplus.gov/genetics/condition/', 'https://medlineplus.gov/genetics/condition/tangier-disease/', 'https://medlineplus.gov/genetics/condition/ataxia-telangiectasia/', 'https://medlineplus.gov/genetics/condition/alopecia-areata/', 'https://medlineplus.gov/genetics/condition/triple-a-syndrome/', 'https://medlineplus.gov/genetics/condition/triple-a-syndrome/', 'https://medlineplus.gov/genetics/condition/aromatic-l-amino-acid-decarboxylase-deficiency/', 'https://medlineplus.gov/genetics/condition/aarskog-scott-syndrome/', 'https://medlineplus.gov/genetics/condition/aarskog-scott-syndrome/', 'https://medlineplus.gov/genetics/condition/aarskog-scott-syndrome/', 'https://medlineplus.gov/genetics/condition/pyridoxine-dependent-epilepsy/', 'https://medlineplus.gov/genetics/condition/diamond-blackfan-anemia/', 'https://medlineplus.gov/genetics/condition/diamond-blackfan-anemia/', 'https://me

Printing all the disease titles, checking if the element/attributes are correct in isolating the disease

In [None]:
#Find the element/attributes to return the title of the disease

url = 'https://medlineplus.gov/genetics/condition/'

response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')
a_elements = soup.find_all('a')

for a_element in a_elements:
    print(a_element.text)


Scraping all of the genetic conditions on the page

In [None]:
import pandas as pd
df = pd.DataFrame()


In [None]:
diseases = {}
import time
for link in soup.find_all('a'):
  disease = link.text
  disease_url = link.get('href')
  if disease_url and '/condition/' in disease_url: #if the link leads to a disease page
    disease_response = requests.get(disease_url)
    disease_soup = BeautifulSoup(disease_response.text, 'html.parser')

    description_section = disease_soup.find('div', class_='mp-content')
    causes_section = disease_soup.find('div', class_="mp-exp exp-full", attrs={"data-bookmark": "causes"})
    frequency_section = disease_soup.find('div', class_="mp-exp exp-full", attrs={"data-bookmark": "frequency"})
    inheritance_section = disease_soup.find('div', class_="mp-exp exp-full", attrs={"data-bookmark": "inheritance"})

    description = description_section.get_text(strip=True) if description_section is not None else ""
    causes = causes_section.get_text(strip=True) if causes_section is not None else ""
    frequency = frequency_section.get_text(strip=True) if causes_section is not None else ""
    inheritance = inheritance_section.get_text(strip=True) if causes_section is not None else ""
    
    diseases[disease] = {
            "description": description,
            "causes": causes,
            "frequency": frequency,
            "inheritance": inheritance
        }

    time.sleep(0.05)



Putting the disease and their description, cause, frequency, and inheritance pattern into a dataframe

In [None]:
df = pd.DataFrame.from_dict(diseases, orient='index', columns=["description", "causes", "frequency", "inheritance"])

In [None]:
print(diseases)

{'Customer Support': {'description': '', 'causes': '', 'frequency': '', 'inheritance': ''}, 'Tangier disease': {'description': 'Tangier disease is an inherited disorder characterized by significantly reduced levels of high-density lipoprotein (HDL) in the blood.HDL transports cholesteroland certain fats called phospholipids from the body\'s tissues to the liver, where they are removed from the blood. HDL is often referred to as "good cholesterol" because high levels of this substance reduce the chances of developing heart and blood vessel (cardiovascular) disease. Because people with Tangier disease have very low levels of HDL, they have a moderately increased risk of cardiovascular disease.Additional signs and symptoms of Tangier disease include a slightly elevated amount of fat in the blood (mild hypertriglyceridemia); disturbances in nerve function (neuropathy); and enlarged, orange-colored tonsils. Affected individuals often developatherosclerosis, which is an accumulation of fatty

In [None]:
df

Unnamed: 0,description,causes,frequency,inheritance
Customer Support,,,,
Tangier disease,Tangier disease is an inherited disorder chara...,CausesMutations in theABCA1gene cause Tangier ...,FrequencyTangier disease is a rare disorder wi...,InheritanceThis condition is inherited in anau...
Ataxia-telangiectasia,Ataxia-telangiectasia is a rare inherited diso...,CausesVariants (also called mutations) in theA...,FrequencyAtaxia-telangiectasia occurs in 1 in ...,InheritanceAtaxia-telangiectasia is inherited ...
Alopecia areata,Alopecia areata is a common disorder that caus...,CausesThe causes of alopecia areata are comple...,FrequencyAlopecia areata affects 1 in every 50...,InheritanceThe inheritance pattern of alopecia...
Triple A syndrome,Triple A syndrome is an inherited condition ch...,CausesMutations in theAAASgene cause triple A ...,FrequencyTriple A syndrome is a rare condition...,InheritanceThis condition is inherited in anau...
...,...,...,...,...
Spastic paraplegia type 11,Spastic paraplegia type 11 is part of a group ...,CausesMutations in theSPG11gene cause spastic ...,FrequencyOver 100 cases of spastic paraplegia ...,InheritanceThis condition is inherited in anau...
Spastic paraplegia type 49,Spastic paraplegia type 49 is part of a group ...,CausesSpastic paraplegia type 49 is caused by ...,FrequencySpastic paraplegia type 49 is a rare ...,InheritanceThis condition is inherited in anau...
JAK3-deficient severe combined immunodeficiency,JAK3-deficient severe combined immunodeficienc...,CausesJAK3-deficient SCID is caused by mutatio...,FrequencyJAK3-deficient SCID accounts for an e...,InheritanceThis condition is inherited in anau...
Pulmonary arterial hypertension,Pulmonary arterial hypertension is a progressi...,CausesMutations in theBMPR2gene are the most c...,"FrequencyIn the United States, about 1,000 new...",InheritancePulmonary arterial hypertension is ...


Cleaning DF by deleting irrelevant rows, deleting "Causes"/"Frequency"/"Inheritance" from the beginning of each observation in their respective columns, and checking for null values

In [None]:
#delete the customer support row
df = df.tail(-1)
#delete the "Causes", "Frequency", "Inheritance" from the beginning of the observations
df = df.replace("Causes", "", regex = True)
df = df.replace("Frequency", "", regex = True)
df = df.replace("Inheritance", "", regex = True)

df


Unnamed: 0,description,causes,frequency,inheritance
Tangier disease,Tangier disease is an inherited disorder chara...,Mutations in theABCA1gene cause Tangier diseas...,Tangier disease is a rare disorder with approx...,This condition is inherited in anautosomal rec...
Ataxia-telangiectasia,Ataxia-telangiectasia is a rare inherited diso...,Variants (also called mutations) in theATMgene...,"Ataxia-telangiectasia occurs in 1 in 40,000 to...",Ataxia-telangiectasia is inherited in anautoso...
Alopecia areata,Alopecia areata is a common disorder that caus...,The causes of alopecia areata are complex and ...,"Alopecia areata affects 1 in every 500 to 1,00...",The inheritance pattern of alopecia areata is ...
Triple A syndrome,Triple A syndrome is an inherited condition ch...,Mutations in theAAASgene cause triple A syndro...,"Triple A syndrome is a rare condition, althoug...",This condition is inherited in anautosomal rec...
Aromatic l-amino acid decarboxylase deficiency,Aromatic l-amino acid decarboxylase (AADC) def...,Mutations in theDDCgene cause AADC deficiency....,AADC deficiency is a rare disorder. Only about...,This condition is inherited in anautosomal rec...
...,...,...,...,...
Spastic paraplegia type 11,Spastic paraplegia type 11 is part of a group ...,Mutations in theSPG11gene cause spastic parapl...,Over 100 cases of spastic paraplegia type 11 h...,This condition is inherited in anautosomal rec...
Spastic paraplegia type 49,Spastic paraplegia type 49 is part of a group ...,Spastic paraplegia type 49 is caused by mutati...,Spastic paraplegia type 49 is a rare disorder....,This condition is inherited in anautosomal rec...
JAK3-deficient severe combined immunodeficiency,JAK3-deficient severe combined immunodeficienc...,JAK3-deficient SCID is caused by mutations in ...,JAK3-deficient SCID accounts for an estimated ...,This condition is inherited in anautosomal rec...
Pulmonary arterial hypertension,Pulmonary arterial hypertension is a progressi...,Mutations in theBMPR2gene are the most common ...,"In the United States, about 1,000 new cases of...",Pulmonary arterial hypertension is usually spo...


In [None]:
df

Unnamed: 0,description,causes,frequency,inheritance
Tangier disease,Tangier disease is an inherited disorder chara...,Mutations in theABCA1gene cause Tangier diseas...,Tangier disease is a rare disorder with approx...,This condition is inherited in anautosomal rec...
Ataxia-telangiectasia,Ataxia-telangiectasia is a rare inherited diso...,Variants (also called mutations) in theATMgene...,"Ataxia-telangiectasia occurs in 1 in 40,000 to...",Ataxia-telangiectasia is inherited in anautoso...
Alopecia areata,Alopecia areata is a common disorder that caus...,The causes of alopecia areata are complex and ...,"Alopecia areata affects 1 in every 500 to 1,00...",The inheritance pattern of alopecia areata is ...
Triple A syndrome,Triple A syndrome is an inherited condition ch...,Mutations in theAAASgene cause triple A syndro...,"Triple A syndrome is a rare condition, althoug...",This condition is inherited in anautosomal rec...
Aromatic l-amino acid decarboxylase deficiency,Aromatic l-amino acid decarboxylase (AADC) def...,Mutations in theDDCgene cause AADC deficiency....,AADC deficiency is a rare disorder. Only about...,This condition is inherited in anautosomal rec...
...,...,...,...,...
Spastic paraplegia type 11,Spastic paraplegia type 11 is part of a group ...,Mutations in theSPG11gene cause spastic parapl...,Over 100 cases of spastic paraplegia type 11 h...,This condition is inherited in anautosomal rec...
Spastic paraplegia type 49,Spastic paraplegia type 49 is part of a group ...,Spastic paraplegia type 49 is caused by mutati...,Spastic paraplegia type 49 is a rare disorder....,This condition is inherited in anautosomal rec...
JAK3-deficient severe combined immunodeficiency,JAK3-deficient severe combined immunodeficienc...,JAK3-deficient SCID is caused by mutations in ...,JAK3-deficient SCID accounts for an estimated ...,This condition is inherited in anautosomal rec...
Pulmonary arterial hypertension,Pulmonary arterial hypertension is a progressi...,Mutations in theBMPR2gene are the most common ...,"In the United States, about 1,000 new cases of...",Pulmonary arterial hypertension is usually spo...


In [None]:
#check for null
df[df.isna().any(axis=1)]

Unnamed: 0,description,causes,frequency,inheritance


In [None]:
df.to_csv("diseases.csv", index = True)