**Credit:** This notebook was originally written by [Original Author](https://github.com/walid3271).  
This version is kept here for learning/reference purposes only.

# `Web Scraping`


# Introduction to web scraping with BeautifulSoup
* Web scraping involves extracting data from websites.
* Useful for gathering information easily.
* BeautifulSoup, from the bs4 package, is a popular Python library for web scraping.
* Simplifies parsing of HTML and XML documents to extract desired information.

* Reference: [brightdata](https://brightdata.com/blog/how-tos/beautiful-soup-web-scraping#:~:text=Web%20Scraping%20with%20Beautiful%20Soup&text=The%20library%20automatically%20selects%20the,fast%20and%20efficient%20lxml%20parser.)
* User Agent: [whatismybrowser](https://www.whatismybrowser.com/detect/what-is-my-user-agent/)

Installation and Declaration

In [9]:
# %pip install beautifulsoup4 requests

Fetching Web Content

In [10]:
# Fetching Web Content
import requests
url = "https://walid.vercel.app"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"}
response = requests.get(url,headers)
print(response)

<Response [200]>


In [11]:
# Parsing the HTML
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")

Basic Operations with BeautifulSoup

In [12]:
# Find tag name
title = soup.find("title")
print(title)
print(title.text)
print(title.contents)

<title>Walid</title>
Walid
['Walid']


In [13]:
# All occurrences of a tag
links = soup.find_all("a")

**Practical exercises**

In [14]:
import re
lst = []
for i in links:
  result = re.findall(r'https:\/\/(?:www\.)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,}(?:\/[a-zA-Z0-9._~:/?#@!$&\'()*+,;=%-]*)?', str(i))
  if result:
    lst.append(result)
print(lst)

new_lst = []
for i in lst:
  new_lst.append(''.join(i))
print(new_lst)

[['https://www.linkedin.com/in/munsiwalidalhassannizhu'], ['https://www.facebook.com/whalidmunshi'], ['https://huggingface.co/WalidAlHassan'], ['https://github.com/walid3271'], ['https://huggingface.co/WalidAlHassan/Face-Detection-Using-URL'], ['https://huggingface.co/WalidAlHassan/Floor-Object-Rooms-and-Bed-direction-Identification-according-to-Vastu-angle'], ['https://huggingface.co/WalidAlHassan/GMP_Face_Authentication'], ['https://huggingface.co/WalidAlHassan/Find-Direction-Of-A-Bolt'], ['https://huggingface.co/WalidAlHassan/Virtual-Mouse'], ['https://huggingface.co/WalidAlHassan/ChatBot'], ['https://chatbot-with-gemini.streamlit'], ['https://huggingface.co/WalidAlHassan/ChatBot-Gemini'], ['https://huggingface.co/WalidAlHassan/Romero-ChatBot'], ['https://huggingface.co/WalidAlHassan/SCREW-APP'], ['https://huggingface.co/WalidAlHassan/Conveyor-Belt-Screw-Count'], ['https://walid.vercel'], ['https://www.linkedin.com/in/munsiwalidalhassannizhu'], ['https://www.facebook.com/whalidmunsh

In [15]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "http://quotes.toscrape.com/"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"}
response = requests.get(url, headers)
# print(response.content)
print('Status: ',response)
soup = BeautifulSoup(response.text, "html.parser")

quotes = soup.find_all("span", attrs={"class":"text"})
authors = soup.find_all("small", attrs={"class":"author"})

qu = []
for quote, author in zip(quotes, authors):
    qu.append({"Quote": quote.text, "Author": author.text})

csv_file = "Quote.csv"
df = pd.DataFrame(qu)
df.to_csv(csv_file, index=False, encoding="utf-8")

Status:  <Response [200]>


In [16]:
df

Unnamed: 0,Quote,Author
0,“The world as we have created it is a process ...,Albert Einstein
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling
2,“There are only two ways to live your life. On...,Albert Einstein
3,"“The person, be it gentleman or lady, who has ...",Jane Austen
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe
5,“Try not to become a man of success. Rather be...,Albert Einstein
6,“It is better to be hated for what you are tha...,André Gide
7,"“I have not failed. I've just found 10,000 way...",Thomas A. Edison
8,“A woman is like a tea bag; you never know how...,Eleanor Roosevelt
9,"“A day without sunshine is like, you know, nig...",Steve Martin
