<div id="BBox" class="alert alert-success" style="font-family:courier;color:black;justify-content:left;">
<h1>Introduction to Python Web Scrapping</h1>
Beautiful Soup is a Python library used to extract data from HTML and XML files. It creates parse trees that help navigate and search the document for the required information. In this tutorial, we will explore the commonly used methods in Beautiful Soup, using the website example.com as a base.
</div>

In [112]:
# !pip install beautifulsoup4
# !pip install requests

In [113]:
import requests
from bs4 import BeautifulSoup
import time
from datetime import datetime

from IPython.core.display import HTML

In [114]:
website = '''<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ; and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>'''

In [115]:
print(soup.title)
print("----------------------------------")
print(soup.title.name)
print("----------------------------------")
print(soup.title.string)
print("----------------------------------")
print(soup.title.parent.name)
print("----------------------------------")
print(soup.p)
print("----------------------------------")
print(soup.a)
print("----------------------------------")
print(soup.find_all('a'))
print("----------------------------------")
print(soup.find(id="link3"))


<title>Machine learning - Wikipedia</title>
----------------------------------
title
----------------------------------
Machine learning - Wikipedia
----------------------------------
head
----------------------------------
<p><b>Machine learning</b> (<b>ML</b>) is a <a class="mw-redirect" href="/wiki/Field_of_study" title="Field of study">field of study</a> in <a href="/wiki/Artificial_intelligence" title="Artificial intelligence">artificial intelligence</a> concerned with the development and study of <a href="/wiki/Computational_statistics" title="Computational statistics">statistical algorithms</a> that can learn from <a href="/wiki/Data" title="Data">data</a> and <a class="mw-redirect" href="/wiki/Generalize" title="Generalize">generalize</a> to unseen data, and thus perform <a href="/wiki/Task_(computing)" title="Task (computing)">tasks</a> without explicit <a href="/wiki/Machine_code" title="Machine code">instructions</a>.<sup class="reference" id="cite_ref-1"><a href="#cite_note

In [116]:
html = """<div id='BBox' class='alert alert-info' style='font-family:courier;color:black;justify-content:left;'>
<h2>1. Parsing the HTML</h2>
First, let’s load the HTML from a website:<br><br>
</div>"""

display(HTML(html))

In [117]:
# Fetch the HTML content from the website
url = 'https://en.wikipedia.org/wiki/Machine_learning'
response = requests.get(url)

# Parse the HTML using Beautiful Soup
soup = BeautifulSoup(response.text, 'html.parser')


<div id='BBox' class='alert alert-info' style='font-family:courier;color:black;justify-content:left;'>
<h2>2. find Method</h2>
<strong>What it does:</strong> <br>
The find method searches for the first occurrence of a tag that matches your criteria (e.g., tag name, attributes).<br><br>

Example:<br>
Let's find the first "h1" tag in the HTML of our webpage:
</div>

In [118]:
first_heading = soup.find('h1')
print(first_heading.text)

Machine learning


<div id='BBox' class='alert alert-info' style='font-family:courier;color:black;justify-content:left;'>
<h2>3. find_all Method</h2>
<strong>What it does:</strong> <br>
The find_all method retrieves all matching elements from the HTML that fit the given criteria.

Example:<br>
To find all "p" tags on the page:
</div>

In [119]:
all_paragraphs = soup.find_all('p')

# Loop through and print each paragraph's text
for p in all_paragraphs:
    print(p.text)

Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions.[1] Quick progress in the field of deep learning, beginning in 2010s, allowed neural networks to surpass many previous approaches in performance.[2]

ML finds application in many fields, including natural language processing, computer vision, speech recognition, email filtering, agriculture, and medicine.[3][4] The application of ML to business problems is known as predictive analytics.

Statistics and mathematical optimization (mathematical programming) methods comprise the foundations of machine learning. Data mining is a related field of study, focusing on exploratory data analysis (EDA) via unsupervised learning.[6][7]

From a theoretical viewpoint, probably approximately correct (PAC) learning provides a framework for describing machine lea

<div id='BBox' class='alert alert-info' style='font-family:courier;color:black;justify-content:left;'>
<h2>4. select Method</h2>
<strong>What it does:</strong> <br>
The select method allows us to use CSS selectors to target elements. It is more flexible as it can locate elements based on classes, ids, and other complex selectors.

Example:<br>
    Find all elements with the class <strong>example-class</strong>:
</div>

In [120]:
example_elements = soup.select('.sidebar')

# Print out the text of each element
for elem in example_elements:
    print(elem.text)

Part of a series onMachine learningand data mining
Paradigms
Supervised learning
Unsupervised learning
Semi-supervised learning
Self-supervised learning
Reinforcement learning
Meta-learning
Online learning
Batch learning
Curriculum learning
Rule-based learning
Neuro-symbolic AI
Neuromorphic engineering
Quantum machine learning

Problems
Classification
Generative modeling
Regression
Clustering
Dimensionality reduction
Density estimation
Anomaly detection
Data cleaning
AutoML
Association rules
Semantic analysis
Structured prediction
Feature engineering
Feature learning
Learning to rank
Grammar induction
Ontology learning
Multimodal learning

Supervised learning(classification • regression) 
Apprenticeship learning
Decision trees
Ensembles
Bagging
Boosting
Random forest
k-NN
Linear regression
Naive Bayes
Artificial neural networks
Logistic regression
Perceptron
Relevance vector machine (RVM)
Support vector machine (SVM)

Clustering
BIRCH
CURE
Hierarchical
k-means
Fuzzy
Expectation–maximiz

<div id='BBox' class='alert alert-info' style='font-family:courier;color:black;justify-content:left;'>
<h2>5. get_text Method</h2>
<strong>What it does:</strong> <br>
The get_text method extracts all the text inside a tag, ignoring the inner HTML tags.

Example:<br>
Extract the text content of a "div" tag:
</div>

In [121]:
div_text = soup.find('div').get_text() 
print(div_text)








Main menu





Main menu
move to sidebar
hide



		Navigation
	


Main pageContentsCurrent eventsRandom articleAbout WikipediaContact us





		Contribute
	


HelpLearn to editCommunity portalRecent changesUpload file



















Search











Search















Donate








Appearance
















Create account

Log in








Personal tools





 Create account Log in





		Pages for logged out editors learn more



ContributionsTalk











<div id='BBox' class='alert alert-info' style='font-family:courier;color:black;justify-content:left;'>
<h2>6. find_parent Method</h2>
<strong>What it does:</strong> <br>
The find_parent method returns the immediate parent tag of a given element.

Example:<br>
Find the parent of the first anchor "a" tag:
</div>

In [122]:
first_link = soup.find('a')
parent = first_link.find_parent()
print(parent)# parent.text

<body class="skin--responsive skin-vector skin-vector-search-vue mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject mw-editable page-Machine_learning rootpage-Machine_learning skin-vector-2022 action-view"><a class="mw-jump-link" href="#bodyContent">Jump to content</a>
<div class="vector-header-container">
<header class="vector-header mw-header">
<div class="vector-header-start">
<nav aria-label="Site" class="vector-main-menu-landmark">
<div class="vector-dropdown vector-main-menu-dropdown vector-button-flush-left vector-button-flush-right" id="vector-main-menu-dropdown">
<input aria-haspopup="true" aria-label="Main menu" class="vector-dropdown-checkbox" data-event-name="ui.dropdown-vector-main-menu-dropdown" id="vector-main-menu-dropdown-checkbox" role="button" type="checkbox"/>
<label aria-hidden="true" class="vector-dropdown-label cdx-button cdx-button--fake-button cdx-button--fake-button--enabled cdx-button--weight-quiet cdx-button--icon-only" for="vector-main-menu-dropdow

<div id='BBox' class='alert alert-info' style='font-family:courier;color:black;justify-content:left;'>
<h2>7. find_next_sibling Method</h2>
<strong>What it does:</strong> <br>
The find_next_sibling method retrieves the next sibling tag of an element in the DOM.

Example:<br>
Get the next sibling of a specific tag (e.g., an "h1" tag):
</div>

In [123]:
first_heading = soup.find('h1')
next_sibling = first_heading.find_next_sibling()
print(next_sibling.text)




84 languages




Afrikaansالعربيةঅসমীয়াAzərbaycancaتۆرکجهবাংলা閩南語 / Bân-lâm-gúБашҡортсаБеларускаяभोजपुरीБългарскиབོད་ཡིགBosanskiCatalàČeštinaCymraegDanskالدارجةDeutschEestiΕλληνικάEspañolEuskaraفارسیFrançaisGaelgGalego한국어Հայերենहिन्दीBahasa IndonesiaIsiZuluÍslenskaItalianoעבריתಕನ್ನಡქართულიКыргызчаLatviešuLietuviųLigureMagyarМакедонскиമലയാളംमराठीBahasa MelayuМонголNederlands日本語Norsk bokmålNorsk nynorskOccitanଓଡ଼ିଆOʻzbekcha / ўзбекчаپنجابیپښتوPolskiPortuguêsQaraqalpaqshaRomânăRuna SimiРусскийᱥᱟᱱᱛᱟᱲᱤShqipSimple EnglishSlovenščinaکوردیСрпски / srpskiSrpskohrvatski / српскохрватскиSuomiSvenskaTagalogதமிழ்తెలుగుไทยTürkçeУкраїнськаاردوئۇيغۇرچە / UyghurcheTiếng ViệtVõro吴语粵語中文

Edit links





<div id='BBox' class='alert alert-info' style='font-family:courier;color:black;justify-content:left;'>
<h2>8. find_previous_sibling Method</h2>
<strong>What it does:</strong> <br>
The find_previous_sibling method retrieves the previous sibling tag of an element in the DOM.
</div>

In [124]:
first_heading = soup.find('h1')
prev_sibling = first_heading.find_previous_sibling()
print(prev_sibling.text)





Toggle the table of contents









<div id='BBox' class='alert alert-info' style='font-family:courier;color:black;justify-content:left;'>
<h2>9. attrs Attribute</h2>
<strong>What it does:</strong> <br>
The attrs attribute allows you to access the attributes of an HTML tag as a dictionary.

Example:<br>
Find the "href" attribute of an anchor "a" tag:
</div>

In [125]:
first_link = soup.find('a')
link_href = first_link.attrs['href']
print(link_href)

#bodyContent


<div id='BBox' class='alert alert-info' style='font-family:courier;color:black;justify-content:left;'>
<h2>10. decompose Method</h2>
<strong>What it does:</strong> <br>
The decompose method completely removes a tag from the tree and destroys it.

Example:<br>
Let's remove a specific "div" from the HTML:
</div>

In [126]:
div_to_remove = soup.find('h2', {'id': 'History'})
div_to_remove.decompose()

# Now this div no longer exists in the DOM
print(soup.text)





Machine learning - Wikipedia



























Jump to content







Main menu





Main menu
move to sidebar
hide



		Navigation
	


Main pageContentsCurrent eventsRandom articleAbout WikipediaContact us





		Contribute
	


HelpLearn to editCommunity portalRecent changesUpload file



















Search











Search















Donate








Appearance
















Create account

Log in








Personal tools





 Create account Log in





		Pages for logged out editors learn more



ContributionsTalk




























Contents
move to sidebar
hide




(Top)





1
History








2
Relationships to other fields




Toggle Relationships to other fields subsection





2.1
Artificial intelligence








2.2
Data compression








2.3
Data mining








2.4
Generalization








2.5
Statistics








2.6
Statistical physics










3
Theory








4
Approaches




Toggle Approaches subsection





4.1
Supervised learning








4.2

In [127]:
# Find all Anchor tags
for link in soup.find_all('a'):
    print(link.get('href'))

#bodyContent
/wiki/Main_Page
/wiki/Wikipedia:Contents
/wiki/Portal:Current_events
/wiki/Special:Random
/wiki/Wikipedia:About
//en.wikipedia.org/wiki/Wikipedia:Contact_us
/wiki/Help:Contents
/wiki/Help:Introduction
/wiki/Wikipedia:Community_portal
/wiki/Special:RecentChanges
/wiki/Wikipedia:File_upload_wizard
/wiki/Main_Page
/wiki/Special:Search
https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en
/w/index.php?title=Special:CreateAccount&returnto=Machine+learning
/w/index.php?title=Special:UserLogin&returnto=Machine+learning
/w/index.php?title=Special:CreateAccount&returnto=Machine+learning
/w/index.php?title=Special:UserLogin&returnto=Machine+learning
/wiki/Help:Introduction
/wiki/Special:MyContributions
/wiki/Special:MyTalk
#
#History
#Relationships_to_other_fields
#Artificial_intelligence
#Data_compression
#Data_mining
#Generalization
#Statistics
#Statistical_physics
#Theory
#Approaches
#Super

https://search.worldcat.org/issn/1532-298X
/wiki/PMC_(identifier)
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3203449
/wiki/PMID_(identifier)
https://pubmed.ncbi.nlm.nih.gov/21896882
#cite_ref-mining_80-0
/wiki/CiteSeerX_(identifier)
https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.40.6984
/wiki/Doi_(identifier)
https://doi.org/10.1145%2F170035.170072
/wiki/ISBN_(identifier)
/wiki/Special:BookSources/978-0897915922
/wiki/S2CID_(identifier)
https://api.semanticscholar.org/CorpusID:490415
#cite_ref-81
https://doi.org/10.1155%2F2009%2F736398
/wiki/Doi_(identifier)
https://doi.org/10.1155%2F2009%2F736398
/wiki/ISSN_(identifier)
https://search.worldcat.org/issn/1687-6229
#cite_ref-82
https://www.era.lib.ed.ac.uk/bitstream/handle/1842/6656/Plotkin1972.pdf;sequence=1
https://web.archive.org/web/20171222051034/https://www.era.lib.ed.ac.uk/bitstream/handle/1842/6656/Plotkin1972.pdf;sequence=1
/wiki/Wayback_Machine
#cite_ref-83
http://ftp.cs.yale.edu/publications/techreports/tr192.pdf
h

/wiki/Alan_Mackworth
https://archive.org/details/computationalint00pool
/wiki/ISBN_(identifier)
/wiki/Special:BookSources/978-0-19-510270-3
https://web.archive.org/web/20200726131436/https://archive.org/details/computationalint00pool
/wiki/Stuart_J._Russell
/wiki/Peter_Norvig
http://aima.cs.berkeley.edu/
/wiki/ISBN_(identifier)
/wiki/Special:BookSources/0-13-790395-2
/w/index.php?title=Machine_learning&action=edit&section=54
https://ai.stanford.edu/people/nilsson/mlbook.html
https://web.archive.org/web/20190816182600/http://ai.stanford.edu/people/nilsson/mlbook.html
/wiki/Wayback_Machine
/wiki/Trevor_Hastie
/wiki/Robert_Tibshirani
/wiki/Jerome_H._Friedman
https://web.stanford.edu/~hastie/ElemStatLearn/
https://web.archive.org/web/20131027220938/http://www-stat.stanford.edu/%7Etibs/ElemStatLearn//
/wiki/Wayback_Machine
/wiki/ISBN_(identifier)
/wiki/Special:BookSources/0-387-95284-5
/wiki/Pedro_Domingos
/wiki/The_Master_Algorithm
/wiki/ISBN_(identifier)
/wiki/Special:BookSources/978-0-46

/wiki/Template:Differentiable_computing
/wiki/Health_informatics
/wiki/Digital_art
/wiki/Electronic_publishing
/wiki/Cyberwarfare
/wiki/Electronic_voting
/wiki/Video_game
/wiki/Word_processor
/wiki/Operations_research
/wiki/Educational_technology
/wiki/Document_management_system
/wiki/Category:Computer_science
/wiki/Outline_of_computer_science
/wiki/Template:Glossaries_of_computers
/wiki/Help:Authority_control
https://www.wikidata.org/wiki/Q2539#identifiers
https://d-nb.info/gnd/4193754-5
https://id.loc.gov/authorities/sh85079324
https://id.ndl.go.jp/auth/ndlna/001210569
https://aleph.nkp.cz/F/?func=find-c&local_base=aut&ccl_term=ica=ph126143&CON_LNG=ENG
http://olduli.nli.org.il/F/?func=find-b&local_base=NLX10&find_code=UID&request=987007541156405171
https://en.wikipedia.org/w/index.php?title=Machine_learning&oldid=1252479637
/wiki/Help:Category
/wiki/Category:Machine_learning
/wiki/Category:Cybernetics
/wiki/Category:Learning
/wiki/Category:Webarchive_template_wayback_links
/wiki/Cate

<div id='BBox' class='alert alert-danger' style='font-family:courier;color:black;justify-content:left;'>
<h2>Conclusion</h2>
This tutorial introduced some of the most useful methods in Beautiful Soup, from basic tag searching (find and find_all) to more advanced DOM traversal methods (find_next_sibling, find_parent). Beautiful Soup is a powerful tool for web scraping, and these methods will help you extract and manipulate HTML content efficiently.
</div>