# Web Scraping with Python

- Beautiful Soup docs: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [2]:
pip install beautifulsoup4

Collecting beautifulsoup4
  Downloading beautifulsoup4-4.12.2-py3-none-any.whl (142 kB)
     ---------------------------------------- 0.0/143.0 kB ? eta -:--:--
     -- ------------------------------------- 10.2/143.0 kB ? eta -:--:--
     ------- ----------------------------- 30.7/143.0 kB 330.3 kB/s eta 0:00:01
     --------------- --------------------- 61.4/143.0 kB 363.1 kB/s eta 0:00:01
     ---------------------------- ------- 112.6/143.0 kB 547.6 kB/s eta 0:00:01
     --------------------------------- -- 133.1/143.0 kB 605.3 kB/s eta 0:00:01
     ------------------------------------ 143.0/143.0 kB 531.0 kB/s eta 0:00:00
Collecting soupsieve>1.2 (from beautifulsoup4)
  Downloading soupsieve-2.5-py3-none-any.whl.metadata (4.7 kB)
Downloading soupsieve-2.5-py3-none-any.whl (36 kB)
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.12.2 soupsieve-2.5
Note: you may need to restart the kernel to use updated packages.


In [5]:
pip install requests

Collecting requests
  Downloading requests-2.31.0-py3-none-any.whl.metadata (4.6 kB)
Collecting charset-normalizer<4,>=2 (from requests)
  Downloading charset_normalizer-3.3.2-cp310-cp310-win_amd64.whl.metadata (34 kB)
Collecting idna<4,>=2.5 (from requests)
  Downloading idna-3.6-py3-none-any.whl.metadata (9.9 kB)
Collecting urllib3<3,>=1.21.1 (from requests)
  Downloading urllib3-2.1.0-py3-none-any.whl.metadata (6.4 kB)
Collecting certifi>=2017.4.17 (from requests)
  Downloading certifi-2023.11.17-py3-none-any.whl.metadata (2.2 kB)
Downloading requests-2.31.0-py3-none-any.whl (62 kB)
   ---------------------------------------- 0.0/62.6 kB ? eta -:--:--
   ------------------- -------------------- 30.7/62.6 kB 1.3 MB/s eta 0:00:01
   ---------------------------------------  61.4/62.6 kB 812.7 kB/s eta 0:00:01
   ---------------------------------------  61.4/62.6 kB 812.7 kB/s eta 0:00:01
   ---------------------------------------- 62.6/62.6 kB 371.2 kB/s eta 0:00:00
Downloading certifi

In [2]:
import requests
from bs4 import BeautifulSoup

In [7]:
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [8]:
soup = BeautifulSoup(html_doc, 'html.parser')  #Beautiful Soup kütüphanesini kullanarak bir HTML belgesini analiz etmek ve içinden veri çekmek için temel bir adım

In [9]:
type(soup)

bs4.BeautifulSoup

In [11]:
print(soup.prettify())  # belgeyi düzenli bir şekilde biçimlendirilmiş olarak ekrana yazdırır.

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>



In [13]:
soup.find_all('a') #(<a> etiketlerini) bulup ekrana yazdırır.

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [19]:
# more readable code example:
for link in soup.find_all('a'):    #Beautiful Soup'un find_all fonksiyonuyla belgedeki tüm <a> etiketlerini bulur.
    id= link.get('id')            #Döngü içinde, her bir <a> etiketinin id ve href özelliklerini link.get('id') ve link.get('href') ile alır.
    link = link.get('href')
    print(f"{id} : {link}")      #f"{id} : {link}" ifadesiyle, her bağlantının id ve href bilgilerini birleştirip, ekrana daha okunabilir bir şekilde yazdırır.
    
    #"href", "Hypertext Reference"ın kısaltmasıdır ve bir bağlantının hedefini (URL veya başka bir belge) belirtir.

link1 : http://example.com/elsie
link2 : http://example.com/lacie
link3 : http://example.com/tillie


In [20]:
print(soup.get_text())
#BeautifulSoup nesnesi içindeki metin içeriğini çeker. Yani, HTML veya XML belgesindeki tüm metinleri alır ve ekrana yazdırır. 
# Bu sayede, sadece sayfa metniyle ilgileniyorsanız, diğer etiketlerden arındırılmış metni elde edebilirsiniz.

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...



## Get URL Example

In [3]:
url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
text = requests.get(url).content.decode('utf-8')

In [4]:

wiki_soup = BeautifulSoup(text, 'html.parser')

In [5]:
type(wiki_soup)

bs4.BeautifulSoup

In [6]:
wiki_soup.title

<title>Python (programming language) - Wikipedia</title>