# Learn Python for SEO
Author: Alex Galea   
Date: November 2016

In [1]:
import sys
print('Python version: %s.%s' % (sys.version_info.major, sys.version_info.minor))

Python version: 3.5


___
## Tutorial 5: Scraping a Webpage

There are many good ways of pulling data from webpages with Python. We'll use the requests library to get the html text and then BeautifulSoup4 to parse it.

 - Instantiate BeautifulSoup object with webpage data
 - Get page title and meta-description
 - Get headings and other text from the page

### From Webpage to Python Object

In [None]:
import requests
from bs4 import BeautifulSoup

In [None]:
url = 'http://www.fullbeauty.com'
page_fb = requests.get(url)

In [23]:
# Response code from request

page_fb

<Response [200]>


Looking at the text attribute of the requests object `page`, we can already see the title and meta description we are looking to extract.

In [25]:
print(page_fb.text[:1000])



<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema/">

<head id="ctl00_Head1"><title>
	fullbeauty Official Site - Shop Plus Size Clothing
</title><meta name="description" content="fullbeauty delivers you the best selection on plus size clothing available online. fullbeauty has plus size clothing apparel for women, men and plus size living all in one location. " /><!-- MAXCHMB OSP_AffiliateTracking_Header --><!-- From Cache CHM3 ContentId: 206776 --><!-- Begin Monetate ExpressTag Sync v8. Place at start of document head. DO NOT ALTER. -->
<script type="text/javascript">var monetateT = new Date().getTime();</script>
<script type="text/javascript" src="//se.monetate.net/js/2/a-7736c7c2/p/fullbeauty.com/entry.js"></script>

<!-- End Monetate tag. -->
<!-- Begin Monetate initial

We'll convert this text to a a BeautifulSoup object for easy parsing. This can be done with `page.text` but it's better to use the `content` attribute, which contains the html code in byte format.

In [31]:
type(page_fb.text), type(page_fb.content)

(str, bytes)

In [None]:
soup_fb = BeautifulSoup(page_fb.content, 'html.parser')

In [9]:
soup = soup_fb

As promised, we now have the webpage as a BeautifulSoup object.

In [32]:
type(soup)

bs4.BeautifulSoup

Now let's see what we can do with it.

___
### Getting the Webpage Title

In [34]:
# Find all title tags on the page

soup.findAll('title')

[<title>
 	fullbeauty Official Site - Shop Plus Size Clothing
 </title>]

In [49]:
# Get the text

title = soup.findAll('title')[0].text.strip()
title

'fullbeauty Official Site - Shop Plus Size Clothing'

___
### Getting the Meta Description

In [35]:
# Find all meta tags on the page

soup.findAll('meta')

[<meta content="fullbeauty delivers you the best selection on plus size clothing available online. fullbeauty has plus size clothing apparel for women, men and plus size living all in one location. " name="description"/>,
 <meta content="3a3454312b35dde2763c7a848091ac57" name="p:domain_verify"/>,
 <meta content="Gt5b0rw1xkxDChb5nnuib4hC6cg-7va0m-VjOhlqlOk" name="google-site-verification"/>,
 <meta content="https://www.facebook.com/pages/fullbeauty/248083818548669" itemprop="sameAs"/>,
 <meta content="https://twitter.com/fullbeautystyle" itemprop="sameAs"/>,
 <meta content="https://www.pinterest.com/fullbeautystyle/" itemprop="sameAs"/>,
 <meta content="https://plus.google.com/+Fullbeautystyle/" itemprop="sameAs"/>,
 <meta content=" https://www.youtube.com/channel/UCBQwBS7I72sXIphAQKlGp9w " itemprop="sameAs"/>,
 <meta content="index,follow,noydir,noodp" name="robots"/>,
 <meta content="com, es" name="language"/>,
 <meta content="fullbeauty.com" name="author"/>,
 <meta content="text/html

In [36]:
# Find the description meta tag

soup.findAll('meta', {'name': 'description'})

[<meta content="fullbeauty delivers you the best selection on plus size clothing available online. fullbeauty has plus size clothing apparel for women, men and plus size living all in one location. " name="description"/>]

In [59]:
# Get the text

meta_description = soup.findAll('meta', {'name': 'description'})[0]
meta_description

<meta content="fullbeauty delivers you the best selection on plus size clothing available online. fullbeauty has plus size clothing apparel for women, men and plus size living all in one location. " name="description"/>

In [99]:
# What happens when you call meta_description.text ?



In [101]:
# It looks like we'll have to find another way to parse it. Below the
# object has been converted to a string. Try using the split function
# to manually parse the text we are interested in and assign it to 
# a variable.

meta_description = str(meta_description)

___
### Getting the Headings

Let's open a new webpage to use for the remainder of the tutorial. It contains a "top 10" list of knife products that we will put into a list. First though let's get the headings.

In [None]:
url = 'http://www1.macys.com/cms/ce/splash/kitchen/cutlery-buying-guide'
page_macys = requests.get(url)
soup_macys = BeautifulSoup(page_macys.content, 'html.parser')

In [102]:
soup = soup_macys

In [105]:
soup.findAll('h1')

[<h1>keep your kitchen on point</h1>]

In [116]:
soup.findAll('h2')

[<h2>find the knives you need to get the job done right.</h2>]

In [124]:
# We'll store all the headings in a dictionary

headings = {}
for h in ['h1', 'h2', 'h3', 'h4', 'h5']:
    headings[h] = soup.findAll(h)
    
headings

{'h1': [<h1>keep your kitchen on point</h1>],
 'h2': [<h2>find the knives you need to get the job done right.</h2>],
 'h3': [<h3>8” CHEF'S KNIFE</h3>,
  <h3>3” or 4” PARING KNIFE</h3>,
  <h3>8” BREAD KNIFE</h3>,
  <h3>7” SANTOKU KNIFE</h3>,
  <h3>8” CARVING KNIFE</h3>,
  <h3>6” UTILITY KNIFE</h3>,
  <h3>5” SERRATED UTILITY KNIFE</h3>,
  <h3>7” FILLET KNIFE</h3>,
  <h3>7” CLEAVER</h3>,
  <h3>5½” BONING KNIFE</h3>],
 'h4': [<h4>cutting-edge talent</h4>],
 'h5': [<h5>these names have serious chops—stock your space now.</h5>]}

In [126]:
# Iterate through the dictionary parsing the text from
# the soup objects in each list

temp_dict = {}
for key, value in headings.items():
    
    temp_value = []
    for v in value:
        temp_value.append(v.text)
    
    temp_dict[key] = temp_value

headings = temp_dict
headings

{'h1': ['keep your kitchen on point'],
 'h2': ['find the knives you need to get the job done right.'],
 'h3': ["8” CHEF'S KNIFE",
  '3” or 4” PARING KNIFE',
  '8” BREAD KNIFE',
  '7” SANTOKU KNIFE',
  '8” CARVING KNIFE',
  '6” UTILITY KNIFE',
  '5” SERRATED UTILITY KNIFE',
  '7” FILLET KNIFE',
  '7” CLEAVER',
  '5½” BONING KNIFE'],
 'h4': ['cutting-edge talent'],
 'h5': ['these names have serious chops—stock your space now.']}

___
### Getting Other Body Text

We notice that the top 10 knives have already been found as they were wrapped in h3 tags. However, there is also a description available for each that we would like to get.

In [128]:
# How many div tags do we find?

len(soup.findAll('div'))

76

Searching for the div tag, we find a a lot of elements. By finding the relevent section in the html source, we notice that our div section for the top 10 list has the id "cutleryguide-container".

In [133]:
# Use findAll to search for the div tag containing the specified id
# and store it in a variable named top10
# hint: see the meta description example above



In [140]:
top10 = soup.findAll('div', {'id': 'cutleryguide-container'})[0]

In [None]:
top10

Since each element returned by the `findAll` method is another soup object, we can call `findAll` again. Since we notice that the descriptions are contained in `<p>` paragraph tags, we can get them easily.

In [146]:
top10_descriptions = top10.findAll('p')
top10_descriptions

[<p>The most versatile addition to your kitchen features a gently curved blade that's ideal for chopping, mincing &amp; most cooking tasks.</p>,
 <p>Equipped with a straight edge that feels like an extension of the hand, the paring knife delivers precision when cutting small foods.</p>,
 <p>Made to glide right through crusty breads, this long blade has a serrated edge that neatly slices without denting or crushing.</p>,
 <p>The Japanese knife perfects thin, delicate slicing with a lighter, harder blade. A hollow edge decreases friction for precise results.</p>,
 <p>Ranging from rigid to flexible, this thinner blade is cut out for carving poultry, hams, roasts and other large cuts of meat.</p>,
 <p>The utility knife is your on-call companion, expertly mincing, peeling &amp; carving vegetables &amp; fruits.</p>,
 <p>Great for food that is soft on the outside &amp; firm on the inside, this mid-size blade was made to prep everything from tomatoes to sausages.</p>,
 <p>Effortlessly fillet f

As we did earlier for the dictionary, we could loop over the list extracting the text from each element. Instead though, let's do this using a so called _list comprehension_.

In [147]:
top10_descriptions = [element.text for element in top10_descriptions]
top10_descriptions

["The most versatile addition to your kitchen features a gently curved blade that's ideal for chopping, mincing & most cooking tasks.",
 'Equipped with a straight edge that feels like an extension of the hand, the paring knife delivers precision when cutting small foods.',
 'Made to glide right through crusty breads, this long blade has a serrated edge that neatly slices without denting or crushing.',
 'The Japanese knife perfects thin, delicate slicing with a lighter, harder blade. A hollow edge decreases friction for precise results.',
 'Ranging from rigid to flexible, this thinner blade is cut out for carving poultry, hams, roasts and other large cuts of meat.',
 'The utility knife is your on-call companion, expertly mincing, peeling & carving vegetables & fruits.',
 'Great for food that is soft on the outside & firm on the inside, this mid-size blade was made to prep everything from tomatoes to sausages.',
 'Effortlessly fillet fish & gently slice other delicate foods with this sle

Now we just need to link these up with the knife names and assign the correct ranks based on order.

In [153]:
top10_knives = [(rank, knife, description) for rank, knife, description
                in zip(range(1,11), headings['h3'], top10_descriptions)]

In [154]:
top10_knives

[(1,
  "8” CHEF'S KNIFE",
  "The most versatile addition to your kitchen features a gently curved blade that's ideal for chopping, mincing & most cooking tasks."),
 (2,
  '3” or 4” PARING KNIFE',
  'Equipped with a straight edge that feels like an extension of the hand, the paring knife delivers precision when cutting small foods.'),
 (3,
  '8” BREAD KNIFE',
  'Made to glide right through crusty breads, this long blade has a serrated edge that neatly slices without denting or crushing.'),
 (4,
  '7” SANTOKU KNIFE',
  'The Japanese knife perfects thin, delicate slicing with a lighter, harder blade. A hollow edge decreases friction for precise results.'),
 (5,
  '8” CARVING KNIFE',
  'Ranging from rigid to flexible, this thinner blade is cut out for carving poultry, hams, roasts and other large cuts of meat.'),
 (6,
  '6” UTILITY KNIFE',
  'The utility knife is your on-call companion, expertly mincing, peeling & carving vegetables & fruits.'),
 (7,
  '5” SERRATED UTILITY KNIFE',
  'Great

In [155]:
# Use indexing to access the 7" CLEAVER description from the top10_knives list.

