# Beautiful Soup Tutorial - Web Scraping in Python

Tutorial for improve skills: 'Beautiful Soup Tutorial - Web Scraping in Python' (freeCodeCamp.org) by Marcus Mariano

**For more information about Marcus Mariano: [Web site](https://marcusmariano.github.io/mmariano/)**  

**Beautiful Soup Tutorial - Web Scraping in Python [here.](https://www.youtube.com/watch?v=87Gx3U0BDlo&t=219s)** 

In [1]:
from requests import get
from bs4 import BeautifulSoup as bs4

In [4]:
# Using the requests module, we use the "get" function
# provided to access the webpage provided as an
# argument to this function:
result = get("https://www.google.com/")

In [5]:
# To make sure that the website is accessible, we can
# ensure that we obtain a 200 OK response to indicate
# that the page is indeed present:
print(result.status_code)

200


In [6]:
# For other potential status codes you may encounter,
# consult the following Wikipedia page:
# https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

# We can also check the HTTP header of the website to
# verify that we have indeed accessed the correct page:
print(result.headers)

{'Date': 'Sat, 02 May 2020 13:15:48 GMT', 'Expires': '-1', 'Cache-Control': 'private, max-age=0', 'Content-Type': 'text/html; charset=ISO-8859-1', 'P3P': 'CP="This is not a P3P policy! See g.co/p3phelp for more info."', 'Content-Encoding': 'gzip', 'Server': 'gws', 'X-XSS-Protection': '0', 'X-Frame-Options': 'SAMEORIGIN', 'Set-Cookie': '1P_JAR=2020-05-02-13; expires=Mon, 01-Jun-2020 13:15:48 GMT; path=/; domain=.google.com; Secure, NID=203=SDM2iz7jeVatq-lEvEa7tElkWIYNaGTTGJq2YCXqrnvNezLiI4pdwL0WlQhBIvQ8mqr44PpsIzdzfrfosu1rd8dCfb3IfqqQITcLKzIKXo30NC3cFjzsbVltv6AQfsTJqZ05i56xyyyiCyn9RPMiMdWSnhplBOfaItZWB_ziHfw; expires=Sun, 01-Nov-2020 13:15:48 GMT; path=/; domain=.google.com; HttpOnly', 'Alt-Svc': 'h3-Q050=":443"; ma=2592000,h3-Q049=":443"; ma=2592000,h3-Q048=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"', 'Transfer-Encoding': 'chunked'}


In [7]:
# For more information on HTTP headers and the information
# one can obtain from them, you may consult:
# https://en.wikipedia.org/wiki/List_of_HTTP_header_fields

# Now, let us store the page content of the website accessed
# from requests to a variable:
src = result.content

In [9]:
# Now that we have the page source stored, we will use the
# BeautifulSoup module to parse and process the source.
# To do so, we create a BeautifulSoup object based on the
# source variable we created above:
soup = bs4(src, 'lxml')

In [12]:
soup

<!DOCTYPE html>
<html itemscope="" itemtype="http://schema.org/WebPage" lang="pt-BR"><head><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"/><title>Google</title><script nonce="O90sBvtV/ayDJmUPH35jzA==">(function(){window.google={kEI:'hHKtXq_0KtDO5OUPlNWZ2AQ',kEXPI:'0,202123,3,4,1151616,5663,731,223,5105,206,1245,1959,10,1051,175,364,1435,4,60,576,127,114,383,246,5,860,494,377,176,260,25,200,327,329,58,1160,406,571,1123877,1197767,382,78,329040,1294,12383,4855,32691,15248,867,28684,369,8819,8384,1326,3532,1362,9290,3021,4747,3121,7912,1808,4020,978,7931,5297,2974,873,38,1179,1714,1,7690,11306,3222,4516,1397,1383,517,400,2277,8,2796,1593,1167,112,2212,532,147,561,542,840,517,1142,278,50,52,158,4100,312,1137,2,2063,606,1839,184,544,1233,522,1945,245,502,1482,93,328,1284,16,2927,2246,474,1339,748,1039,2269,1,957,773,1217,855,7,6068,8078,2662,642,605,802,1043,2458,1226,1462,3934

In [11]:
# Now that the page source has been processed via Beautifulsoup
# we can access specific information directly from it. For instance,
# say we want to see a list of all of the links on the page:
links = soup.find_all("a")
print(links)
print("\n")

[<a class="gb1" href="https://www.google.com.br/imghp?hl=pt-BR&amp;tab=wi">Imagens</a>, <a class="gb1" href="https://maps.google.com.br/maps?hl=pt-BR&amp;tab=wl">Maps</a>, <a class="gb1" href="https://play.google.com/?hl=pt-BR&amp;tab=w8">Play</a>, <a class="gb1" href="https://www.youtube.com/?gl=BR&amp;tab=w1">YouTube</a>, <a class="gb1" href="https://news.google.com.br/nwshp?hl=pt-BR&amp;tab=wn">Notícias</a>, <a class="gb1" href="https://mail.google.com/mail/?tab=wm">Gmail</a>, <a class="gb1" href="https://drive.google.com/?tab=wo">Drive</a>, <a class="gb1" href="https://www.google.com.br/intl/pt-BR/about/products?tab=wh" style="text-decoration:none"><u>Mais</u> »</a>, <a class="gb4" href="http://www.google.com.br/history/optout?hl=pt-BR">Histórico da Web</a>, <a class="gb4" href="/preferences?hl=pt-BR">Configurações</a>, <a class="gb4" href="https://accounts.google.com/ServiceLogin?hl=pt-BR&amp;passive=true&amp;continue=https://www.google.com/" id="gb_70" target="_top">Fazer login</

In [16]:
# Perhaps we just want to extract the link that has contains the text
# "About" on the page instead of every link. We can use the built-in
# "text" function to access the text content between the <a> </a>
# tags.
for link in links:
    if "Sobre" in link.text:
        print(link)
        print(link.attrs['href'])

<a href="/intl/pt-BR/about.html">Sobre o Google</a>
/intl/pt-BR/about.html
