<a href="https://colab.research.google.com/github/zypchn/med-data-tr/blob/main/hastalarsoruyor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
! pip install requests -q
! pip install html5lib -q
! pip install bs4 -q
! pip install tiktoken -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/1.2 MB[0m [31m4.6 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.2/1.2 MB[0m [31m18.0 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor, as_completed
import time
import pandas as pd
import tiktoken
from google.colab import files
import json

Some info about the website structure :
- It does not display the entire data at once, rather uses a "show more" button.
- Data is fetched using an API, not loaded dynamically via JavaScript. <br/>Therefore, BeautifulSoup and requests library can scrape the html data.

In [13]:
base_url = "https://hastalarsoruyor.com"
q_url = base_url + "/sorular"

# Getting the URLs

In [14]:
def get_all_urls():
    urls = []
    num_page = 1
    isEnd = False

    while not isEnd:
        params = {
            "sayfa": num_page,
            "sirala": "cevapli"
        }
        api_url = "https://hastalarsoruyor.com/soru-lar/liste"
        response = requests.get(api_url, params=params)
        response.raise_for_status()
        soup = BeautifulSoup(response.content, "html.parser")

        divs = soup.find_all("div", {"class": "question-body"})
        if len(divs) == 0:
            print(f"Data ends at page: {num_page}")
            isEnd = True
        else:
            for div in divs:
                a_tag = div.find("a")
                if a_tag and "href" in a_tag.attrs:
                    urls.append(a_tag["href"])
            print(f"URL extraction is successful for page: {num_page}")

        num_page += 1

    return urls

In [15]:
all_urls = get_all_urls()   # takes 5m (1sec for each page)

URL extraction is successful for page: 1
URL extraction is successful for page: 2
URL extraction is successful for page: 3
URL extraction is successful for page: 4
URL extraction is successful for page: 5
URL extraction is successful for page: 6
URL extraction is successful for page: 7
URL extraction is successful for page: 8
URL extraction is successful for page: 9
URL extraction is successful for page: 10
URL extraction is successful for page: 11
URL extraction is successful for page: 12
URL extraction is successful for page: 13
URL extraction is successful for page: 14
URL extraction is successful for page: 15
URL extraction is successful for page: 16
URL extraction is successful for page: 17
URL extraction is successful for page: 18
URL extraction is successful for page: 19
URL extraction is successful for page: 20
URL extraction is successful for page: 21
URL extraction is successful for page: 22
URL extraction is successful for page: 23
URL extraction is successful for page: 24
U

In [16]:
len(all_urls)

1947

# Getting Text Content

- It was seen that some questions have more that 1 answers. But to keep the dataset structured, only the 1st answer was parsed.

- *a* element which holds the medical field data has a class attr. "text-primary"
- *h1* element which holds the question title data has a class attr. "question-title"
- *p* element which holds the question body data has a class attr. "question-desc"
- *p* element which holds the answer body data has a class attr. "py-10" (css for padding 10px in the y axis)


In [47]:
def num_tokens_from_string(string: str, encoder_name: str) -> int:
    # encoding = tiktoken.encoding_for_model(model_name)
    encoding = tiktoken.get_encoding(encoder_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

In [51]:
records = []

def get_text(url, encoder_name):
  res = requests.get(url)
  soup = BeautifulSoup(res.content, "html.parser")

  a_tag = soup.find_all("a", {"class": "text-primary"})
  question_field = a_tag[1].text.split("#")[1]
  question_header = soup.find("h1", {"class": "question-title"}).text
  question_text = soup.find("p", {"class": "question-desc"}).text.strip()
  question_answer = soup.find("div", {"class": "py-10"}).text.strip()
  num_tokens_q = num_tokens_from_string(question_text, encoder_name)
  num_tokens_a = num_tokens_from_string(question_answer, encoder_name)
  num_tokens_total = num_tokens_q + num_tokens_a

  rec = {
      "field": question_field,
      "title": question_header,
      "question": question_text,
      "answer": question_answer,
      "num_tokens_pair": num_tokens_total
  }
  print(f"Text extraction is successful for {question_header[:10]}")
  records.append(rec)

In [52]:
def get_all_text(urls, encoder_name, num_workers):
  with ThreadPoolExecutor(max_workers=num_workers) as executor:
    futures = [
        executor.submit(get_text, url, encoder_name) for url in urls
    ]

    for future in as_completed(futures):
      try:
        future.result()
      except Exception as e:
        print(f"Thread Error: {e}")

In [None]:
encoder = "o200k_base"
get_all_text(all_urls, encoder, 10)

In [54]:
len(records)    # fetched all the records

1947

In [55]:
df = pd.DataFrame(data=records)

In [56]:
df.head()

Unnamed: 0,field,title,question,answer,num_tokens_pair
0,Beyin & Sinir,4. ventrikül normal büyüklük ve konfigürasyond...,4. ventrikül normal büyüklük ve konfigürasyond...,Göndermiş olduğunuz beyin MR raporunda genel o...,440
1,Vertigo,Şiddetli Baş Dönmesi ve bacaklarda Hissizlik,Merhabalar öncelikle kolay gelsin. 21 yaşında ...,Yaşadığınız baş dönmesi ve bacaklarda hissizli...,305
2,Karaciğer,Eklem ağrılarım ve morarmaların yüzünden dokto...,iyi günler. Eklem ağrılarım ve morarmaların yü...,"Paylaştığınız test sonuçlarına göre, ANA (Anti...",161
3,Gebelik (Hamilelik),İlişkiye girdikten 24 saat sonra içilen Ella h...,Hocam merhaba. 1 Aralık 2023 tarihinde kız ark...,Verdiğiniz bu bilgilere göre kız arkadaşınızın...,228
4,Doğum Kontrol,Zevk Suyu Hamile Bırakır Mı?,Öncelikle merhaba. Nişanlımla 14 Aralık tarihi...,İlişki sırasında içeri boşalma olmaması hamile...,354


In [63]:
field_counts = df["field"].value_counts()     # top 5 fields of questions asked
field_counts.head(5)

Unnamed: 0_level_0,count
field,Unnamed: 1_level_1
Beyin & Sinir,131
Kadın Sağlığı,99
Gebelik (Hamilelik),87
Deri Hastalıkları,71
Cinsel Sağlık,67


In [60]:
df["num_tokens_pair"].sum()     # total number of tokens

552047