# Obtaining text for Natural Language Processing (NLP) Tasks

Beautiful Soup is a Python library designed to help you easily extract information from web pages by parsing HTML and XML documents. Here we provide a step-by-step tutorial on how to use Beautiful Soup for web scraping

https://beautiful-soup-4.readthedocs.io/en/latest/

Step 1: Install Necessary Libraries: Install the requests and BeautifulSoup libraries. 

In [1]:
# !pip install requests beautifulsoup4

Step 2: Import Libraries

In [2]:
import requests
from bs4 import BeautifulSoup

Step 3: Make an HTTP Request: Choose a website you want to scrape and send a GET request to it. For this example, let's scrape Google's homepage.

In [3]:
url = 'https://google.com'
response = requests.get(url)

Step 4: Parse the HTML Content - Once you have the HTML content, you can use Beautiful Soup to parse it:

In [4]:
soup = BeautifulSoup(response.text, 'html.parser')

Step 5: Extract Data from the HTML. Let's say you want to extract all the headings:

In [5]:
headings = soup.find_all('div')
for heading in headings:
    print(heading.text.strip())

Search Images Maps Play YouTube News Gmail Drive More »Web History | Settings | Sign in
Search Images Maps Play YouTube News Gmail Drive More »
Web History | Settings | Sign in





Google offered in:  Afrikaans Sesotho isiZulu IsiXhosa Setswana Northern Sotho
Google offered in:  Afrikaans Sesotho isiZulu IsiXhosa Setswana Northern Sotho
Google offered in:  Afrikaans Sesotho isiZulu IsiXhosa Setswana Northern Sotho
AdvertisingBusiness SolutionsAbout GoogleGoogle.co.za
AdvertisingBusiness SolutionsAbout GoogleGoogle.co.za


Step 6: Handle Errors

In [6]:
if response.status_code == 200:
    # Proceed with scraping
    print("Scraping web page")
else:
    print("Failed to retrieve the web page")

Scraping web page


### Pulling Headlines from a News Site

Once you have Beautiful Soup installed on your machine, pull a website into Jupyter notebook just by entering its URL. But note that you can’t access the website directly in Beautiful Soup. You need get the document behind the webpage with requests, and then feed that document into Beautiful Soup using requests’s content attribute along with an HTML parser.

Choose a website you want to scrape and send a GET request.

In [7]:
url = 'https://www.theguardian.com/international'
news_get_request = requests.get(url)

Once you have the HTML content, you can use Beautiful Soup to parse the price information:

In [8]:
news = BeautifulSoup(news_get_request.content, 'html.parser')

Use your default browser inspector to get a look at the HTML behind the webpage. The inspector is called slightly different names depending on the browser, but the keyboard shortcut is control-shift-I in Windows-based browsers, and command-option-I works in Safari. From there, say that I can see that all headlines on the home page are within an `<h3>` tag.

Let’s take a look at the first three of them using findAll().

In [9]:
headlines = news.findAll('h3')
headlines[0:3]

[<h3 class="dcr-1sp44ag"><div class="dcr-v1s16m">France</div><span class="show-underline dcr-1ay6c8s">Leftwing parties form ‘Popular Front’ to contest snap election</span></h3>,
 <h3 class="dcr-10yfzki"><div class="dcr-v1s16m">G7 summit</div><span class="show-underline dcr-1ay6c8s">Joe Biden says ‘democracies can deliver’ as G7 agree $50bn Ukraine aid deal</span></h3>,
 <h3 class="dcr-1iq5aaq"><a class="dcr-1bsckv0" href="/politics/article/2024/jun/13/rishi-sunak-denies-he-is-being-snubbed-after-awkward-start-to-g7-summit"><div class="dcr-17kznwa">UK</div><span class="show-underline dcr-1ay6c8s">Sunak denies he is being snubbed after awkward G7 start</span></a></h3>]

All the HTML and CSS syntax is obtained along with the headlines, when we just want the actual headlines. 

We can isolate the text with Beautiful Soup’s text attribute (or its get_text() function). Here, we use a for loop to make a list of all the headlines on the home page, and then we will display the first four.

In [10]:
all_headlines = [headline.text for headline in headlines]
all_headlines[0:4]

['FranceLeftwing parties form ‘Popular Front’ to contest snap election',
 'G7 summitJoe\xa0Biden says ‘democracies can deliver’ as G7 agree $50bn Ukraine aid deal',
 'UKSunak denies he is being snubbed after awkward G7 start',
 'Middle East liveHamas official says nobody knows how many of the remaining hostages are alive']

 We grabbed the headlines for the day from this news website and made them into a list of strings in Python.