### Web Scraping

In [None]:
import requests
from bs4 import BeautifulSoup

def scrape_webpage(url):
    try:
        # Send a GET request to the URL
        response = requests.get(url)
        response.raise_for_status()  # Raise an HTTPError for bad responses

        # Parse the HTML content of the page
        soup = BeautifulSoup(response.text, 'html.parser')

        # Extract text from paragraphs
        paragraphs = soup.find_all(['p', 'li'])  # Include list items along with paragraphs

        # Combine extracted text into a single paragraph
        extracted_text = '\n'.join([element.get_text() for element in paragraphs])

        return extracted_text

    except requests.exceptions.RequestException as e:
        print(f"Error: {e}")
        return None

diabetes_url = "https://my.clevelandclinic.org/health/diseases/7104-diabetes"

# Scrape the webpage
diabetes_dataset = scrape_webpage(diabetes_url)

# Save the scraped data to a text file
with open("diabetes_dataset.txt", "w") as file:
    file.write(diabetes_dataset)

print("Dataset saved successfully as diabetes_dataset.txt")

Dataset saved successfully as diabetes_dataset.txt


### Data Collection


#### Bucket 1: Diabetes

In [None]:
# load the modified dataset again
def load_dataset(file_path):
    try:
        with open(file_path, "r") as file:
            dataset = file.read().split("\n\n")  # Split paragraphs based on double newline characters
        return dataset
    except FileNotFoundError:
        print("Error: File not found.")
        return None

file_path = "diabetes.txt"
diabetes_dataset = load_dataset(file_path)

for i, paragraph in enumerate(diabetes_dataset):
    print(f"Paragraph {i+1}:")
    print(paragraph)
    print("\n")

Paragraph 1:
Diabetes is a common condition that affects people of all ages. There are several forms of diabetes. Type 2 is the most common. A combination of treatment strategies can help you manage the condition to live a healthy life and prevent complications.
Diabetes is a condition that happens when your blood sugar (glucose) is too high. It develops when your pancreas doesn’t make enough insulin or any at all, or when your body isn’t responding to the effects of insulin properly. Diabetes affects people of all ages. Most forms of diabetes are chronic (lifelong), and all forms are manageable with medications and/or lifestyle changes.
Glucose (sugar) mainly comes from carbohydrates in your food and drinks. It’s your body’s go-to source of energy. Your blood carries glucose to all your body’s cells to use for energy.
When glucose is in your bloodstream, it needs help — a “key” — to reach its final destination. This key is insulin (a hormone). If your pancreas isn’t making enough insu

#### Bucket 2: Cardiovascular Health

In [None]:
heart_url = "https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)"

# Scrape the webpage
heart_dataset = scrape_webpage(heart_url)

# Save the scraped data to a text file
with open("heart_dataset.txt", "w") as file:
    file.write(heart_dataset)

print("Dataset saved successfully as heart_dataset.txt")

Dataset saved successfully as heart_dataset.txt


In [None]:
# after manual processing load file again
file_path = "heart.txt"
diabetes_dataset = load_dataset(file_path)

for i, paragraph in enumerate(diabetes_dataset):
    print(f"Paragraph {i+1}:")
    print(paragraph)
    print("\n")

#### Bucket 3: Hypertension/Blood Pressure

In [None]:
bp_url = "https://www.cdc.gov/bloodpressure/about.htm#:~:text=Blood%20pressure%20is%20measured%20using,your%20bp%20rests%20between%20beats."

# Scrape the webpage
bp_dataset = scrape_webpage(bp_url)

# Save the scraped data to a text file
with open("bp_dataset.txt", "w") as file:
    file.write(bp_dataset)

print("Dataset saved successfully as bp_dataset.txt")

Dataset saved successfully as bp_dataset.txt


In [None]:
# after manual processing load dataset again
file_path = "bp.txt"
diabetes_dataset = load_dataset(file_path)

for i, paragraph in enumerate(diabetes_dataset):
    print(f"Paragraph {i+1}:")
    print(paragraph)
    print("\n")

Paragraph 1:
Blood pressure is the pressure of blood pushing against the walls of your arteries. Arteries carry blood from your heart to other parts of your body. Your blood pressure normally rises and falls throughout the day. Blood pressure is measured using two numbers: The first number, called systolic blood pressure, measures the pressure in your arteries when your heart beats. The second number, called diastolic blood pressure, measures the pressure in your arteries when your heart rests between beats. If the measurement reads 120 systolic and 80 diastolic, you would say, “120 over 80,” or write, “120/80 mmHg.” A normal blood pressure level is less than 120/80 mmHg.1 


Paragraph 2:
No matter your age, you can take steps each day to keep your blood pressure in a healthy range. High blood pressure, also called hypertension, is blood pressure that is higher than normal. Your blood pressure changes throughout the day based on your activities. Having blood pressure measures consisten

### Preprocessing

In [None]:
def preprocess_text(text):
    # Tokenization
    tokens = text.split()

    # Lowercasing
    tokens = [token.lower() for token in tokens]

    # Removing Special Characters and Numbers
    tokens = [token for token in tokens if token.isalpha()]

    # Removing Stopwords
    stopwords = set(["a", "an", "the", "is", "and", "it", "of", "in", "to", "for", "on", "with", "your", "you", "are", "that", "but"])
    tokens = [token for token in tokens if token not in stopwords]

    return tokens

# Load the dataset
file_path = "diabetes_dataset.txt"
with open(file_path, "r") as file:
    diabetes_dataset = file.read().split("\n\n")  # Split paragraphs based on double newline characters

# Preprocess each paragraph in the dataset
preprocessed_dataset = [preprocess_text(paragraph) for paragraph in diabetes_dataset]

# Example usage: Print the preprocessed tokens of the first paragraph
print(preprocessed_dataset[0])