In [None]:
Chapter 6. Web Scraping with Python Requests and BeautifulSoup
We have become experts in how to communicate with the Web through Requests. Everything progressed flamboyantly while working with the APIs. However, there are some conditions where we need to be aware of API folklore.

The first thing that concerns us is not all web services have built an API for the sake of their third-party customers. Also, there is no statute that the API should be maintained perfectly. Even tech giants such as Google, Facebook, and Twitter tend to change their APIs abruptly without prior notice. So, it's better to understand that it is not always the API that comes to the rescue when we are looking for some vital information from a web resource.

The concept of web scraping stands as a savior when we really turn imperative to access some information from a web resource that does not maintain an API. In this chapter, we will discuss tricks of the trade to extract information from web resources by following all the principles of web scraping.

Before we begin, let's get to know some important concepts that will help us to reach our goal. Take a look at the response content format of a request, which will introduce us to a particular type of data:

>>> import requests
>>> r = requests.get("http://en.wikipedia.org/wiki/List_of_algorithms")
>>> r
<Response [200]>
>>> r.text
u'<!DOCTYPE html>\n<html lang="en" dir="ltr" class="client-nojs">\n<head>\n<meta charset="UTF-8" />\n<title>

In [None]:
In the preceding example, the response content is rendered in the form of semistructured data, which is 
represented using HTML tags; this in turn helps us to access the information about the different sections of a web
page individually.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Now, let's get to know the different types of data that the Web generally deals with.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Types of data
In most cases, we deal with three types of data when working with web sources. They are as follows:
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Structured data
Unstructured data
Semistructured Data
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


In [None]:
Structured data
Structured data is a type of data that exists in an organized form. Normally, structured data has a predefined 
format and it is machine readable. Each piece of data that lies in structured data has a relation with every other 
data as a specific format is imposed on it. This makes it easier and faster to access different parts of data. 
The structured data type helps in mitigating redundant data while dealing with huge amounts of data.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Databases always contain structured data, and SQL techniques can be used to access data from them. We can regard census records as an example of structured data. They contain information about the date of birth, gender, place, income, and so on, of the people of a country.

Unstructured data
In contrast to structured data, unstructured data either misses out on a standard format or stays unorganized even though a specific format is imposed on it. Due to this reason, it becomes difficult to deal with different parts of the data. Also, it turns into a tedious task. To handle unstructured data, different techniques such as text analytics, Natural Language Processing (NLP), and data mining are used. Images, scientific data, text-heavy content (such as newspapers, health records, and so on), come under the unstructured data type.

Semistructured data
Semistructured data is a type of data that follows an irregular trend or has a structure which changes rapidly. This data can be a self described one, it uses tags and other markers to establish a semantic relationship among the elements of the data. Semistructured data may contain information that is transferred from different sources. Scraping is the technique that is used to extract information from this type of data. The information available on the Web is a perfect example of semistructured data.

In [None]:
What is web scraping?
In simple words, web scraping is the process of extracting desired data from a web resource. This method involves 
different procedures such as interacting with the web resource, choosing the appropriate data, obtaining 
information from the data, and converting the data to the desired format. With all the previous methods 
considered, a major spotlight will be thrown on the process of pulling the required data from the semistructured 
data.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Dos and don'ts of web scraping
Scraping a web resource is not always welcomed by the owners. Some companies put a restriction on using bots against them. It's etiquette to follow certain rules while scraping. The following are the dos and don'ts of web scraping:



In [None]:
Do refer to the terms and conditions: The first thing that should come to our mind before we begin scraping is terms and conditions. Do visit the website's terms and conditions page and get to know whether they prohibit scraping from their site. If so, it's better to back off.
Don't bombard the server with a lot of requests: Every website runs on a server that can serve only a specific amount of workload. It is equivalent to being rude if we bombard the server with lots of requests in a specific span of time, which may result in sever breakdown. Wait for some time between requests instead of bombarding the server with too many requests at once.
NOTE
Some sites put a restriction on the maximum number of requests processed per minute and will ban the request sender's IP address if this is not adhered to.

Do track the web resource from time to time: A website doesn't always stay the same. According to its usability and the requirement of users, they tend to change from time to time. If any alteration has taken place in the website, our code to scrape may fail. Do remember to track the changes made to the site, modify the scrapper script, and scrape accordingly.
Predominant steps to perform web scraping
Generally, the process of web scraping requires the use of different tools and libraries such as the following:

Chrome DevTools or FireBug Add-on: This can be used to pinpoint the pieces of information in an HTML/XML page.
HTTP libraries: These can be used to interact with the server and to pull a response document. An example of this is python-requests.
Web scraping tools: These are used to pull data from a semistructured document. Examples include BeautifulSoup or Scrappy.
The overall picture of web scraping can be observed in the following steps:

Identify the URL(s) of the web resource to perform the web scraping task.
Use your favorite HTTP client/library to pull the semistructured document.
Before extracting the desired data, discover the pieces of data that are in semistructured format.
Utilize a web scraping tool to parse the acquired semistructured document into a more structured one.
Draw the desired data that we are hoping to use. That's all, we are done!

In [None]:
Key web scraping tasks
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
While pulling the required data from a semistructured document, we perform various tasks. The following are the 
basic tasks that we adopt for scraping:
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Searching a semistructured document: Accessing a particular element or a specific type of element in a document 
can be accomplished using its tag name and tag attributes, such as id, class, and so on.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Navigating within a semistructured document: We can navigate through a web document to pull different types of 
data in four ways, which are navigating down, navigating sideways, navigating up, and navigating back and forth.
We can get to know more about these in detail later in this chapter.Modifying a semistructured document:
By modifying the tag name or the tag attributes of a document, we can streamline and pull the required data.

In [None]:
What is BeautifulSoup?
The BeautifulSoup library is a simple yet powerful web scraping library. It has the capability to extract the desired data when provided with an HTML or XML document. It is charged with some superb methods, which help us to perform web scraping tasks effortlessly.

Document parsers
Document parsers aid us in parsing and serializing the semistructured documents that are written using HTML5, lxml, or any other markup language. By default, BeautifulSoup has Python's standard HTMLParser object. If we are dealing with different types of documents, such as HTML5 and lxml, we need to install them explicitly.

In this chapter, our prime focus will be laid only on particular parts of the library, which help us to understand the techniques to develop a practical scraping bot that we will build at the end of this chapter.

Installation
Installing BeautifulSoup is pretty straightforward. We can use pip to install it with ease:

$ pip install beautifulsoup4
Whenever we intend to scrape a web resource using BeautifulSoup, we need to create a BeautifulSoup object for it. The following are the commands to do this:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(<HTML_DOCUMENT_STRING>)

In [None]:
Objects in BeautifulSoup
The BeautifulSoup object parses the given HTML/XML document and converts it into a tree of Python objects, which are discussed in the following sections.

TAGS
The word "tag" represents an HTML/XML tag in the provided document. Each tag object has a name and a lot of attributes and methods. The following example showcases the way to deal with a tag object:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("<h1 id='message'>Hello, Requests!</h1>")
In order to access the type, name, and attributes of the BeautifulSoup object, with soup, that we created in the preceding example, use the following commands:

For accessing the tag type:
>>> tag = soup.h1
>>> type(tag)
<class 'bs4.element.Tag'>
For accessing the tag name:
>>> tag.name
'h1'
For accessing the tag attribute ('id' in the given html string)
>>> tag['id']
'message'
BEAUTIFULSOUP
The object that gets created when we intend to scrape a web resource is called a BeautifulSoup object. Put simply, it is the complete document that we are planning to scrape. This can be done using the following commands:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("<h1 id='message'>Hello, Requests!</h1>") >>> type(soup)
<class 'bs4.BeautifulSoup'>

In [None]:
NAVIGABLESTRING
A NavigableString object represents the contents of tag. We use the .string attribute of the tag object to access it:

>>> tag.string
u'Hello, Requests!'
COMMENTS
The comment object illustrates the comment part of the web document. The following lines of code exemplify a comment object:

>>> soup = BeautifulSoup("<p><!-- This is comment --></p>")
>>> comment = soup.p.string
>>> type(comment)
<class 'bs4.element.Comment'>
Web scraping tasks related to BeautifulSoup
As cited in the previous section of Key web scraping tasks, BeautifulSoup always follows those basic tasks in the process of web scraping. We can get to know these tasks in detail with the help of a practical example, using an HTML document. We will be using the following HTML document that is scraping_example.html, as an example through out the chapter:

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <title>
      Chapter 6 - Web Scrapping with Python Requests and BeatuifulSoup
    </title>
  </head>
  <body>
    <div class="surveys">
      <div class="survey" id="1">
        <p class="question">
          <a href="/surveys/1">Are you from India?</a>
        </p>
        <ul class="responses">
          <li class="response">Yes - <span class="score">21</span>
          </li>
          <li class="response">No - <span class="score">19</span>
          </li>
        </ul>
      </div>
      <div class="survey" id="2">
        <p class="question">
          <a href="/surveys/2">Have you ever seen the rain?</a>
        </p>
        <ul class="responses">
          <li class="response">Yes - <span class="score">40</span>
          </li>
          <li class="response">No - <span class="score">0</span>
          </li>
        </ul>
      </div>
      <div class="survey" id="3">
        <p class="question">
          <a href="/surveys/1">Do you like grapes?</a>
        </p>
        <ul class="responses">
          <li class="response">Yes - <span class="score">34</span>
          </li>
          <li class="response">No - <span class="score">6</span>
          </li>
        </ul>
      </div>
    </div>
  </body>
</html>
To give a crystal clear understanding of the preceding web document, we showcased it as a document tree. The following diagram represents the preceding HTML document:

Web scraping tasks related to BeautifulSoup
When we create the BeautifulSoup object for the previously shown web document, it will result in a tree of Python objects.

In [None]:
To perform different tasks with the previous document, scraping_example.html, we need to create a BeautifulSoup object. To create it, open the Python shell and run the following commands:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(open("scraping_example.html"))
From now, we will use the preceding BeautifulSoup object to execute different tasks. Let's perform the web scraping tasks on the scraping_example.html document and get an overall idea on all the tasks.

Searching the tree
To identify the different tags in an HTML/XML document, we need to search the whole document. In similar situations, we can use BeautifulSoup methods such as find, find_all, and so on.

Here is the syntax to search the whole document to identify the tags:

find(name, attributes, recursive, text, **kwargs)
name: This is the first occurring tag name that appears in the process of discovery. It can be a string, a regular expression, a list, a function, or the value True.
find_all(name, attributes, recursive, text, limit, **kwargs)
name: This is used to access specific types of tags with their name. It can be a string, a regular expression, a list, a function, or the value True.
limit: This is the maximum number of results in the output.
The common attributes for the preceding two methods are as follows:

attributes: These are the attributes of an HTML/XML tag.
recursive: This takes a Boolean value. If it is set to True, the BeautifulSoup library checks all the children of a specific tag. Vice versa, if it is set to false, the BeautifulSoup library checks the child at the next level only.
text: This parameter identifies tags that consist of the string content.
NAVIGATING WITHIN THE TREE
Different tasks are involved in navigating the document tree with the Beautifulsoup4 module; they are discussed in the following section.

Navigating down
We can access a particular element's data by moving down in a document. If we consider the document tree in the previous figure, we can access different elements by moving downward from the top element—html.

Every element can be accessed using its tag name. Here is a way to access the contents of the html attribute:

>>> soup.html
<html lang="en">
...
...
</html>
Here are the ways in which we can access the elements of the preceding document tree by navigating down. In order to access the title element, we should go from top to bottom, that is, from html to head and from head to title, as shown in the following command:

>>> soup.html.head.title
<title>Chapter 6 - Web Scraping with Python Requests and BeatuifulSoup</title>
Similarly, you can access the meta element, as shown in the following command:

>>> soup.html.head.meta
<meta charset="utf-8"/>
Navigating sideways
To access the siblings in a document tree, we should navigate sideways. The BeautifulSoup library provides various tag object properties such as .next_sibling, .previous_sibling, .next_siblings, and .previous_siblings.

If you look at the preceding diagram containing the document tree, the different siblings at different levels of the tree, when navigated sideways, are as follows:

head and body
div1, div2, and div3
In the document tree, the head tag is the first child of html, and body is the next child of html. In order to access the children of the html tag, we can use its children property:

>>> for child in soup.html.children:
...     print child.name
...
head
body
To access the next sibling of head element we can use .find_next_sibling:

>>> soup.head.find_next_sibling()
<body>
    <div class="surveys">
        .
        .
        .
    </div>
</body>
To access the previous sibling of body, we can use .find_previous_sibling:

>>> soup.body.find_previous_sibling
<head><meta charset="utf-8"/><title>... </title></head>
Navigating up
We can access a particular element's parent by moving toward the top of the document tree. The BeautifulSoup library provides two properties—.parent and .parents—to access the first parent of the tag element and all its ancestors, respectively.

Here is an example:

>>> soup.div.parent.name
'body'

>>> for parent in soup.div.parents:
...     print parent.name
...
body
html
[document]
Navigating back and forth
To access the previously parsed element, we navigate back in the node of a tree, and to access the immediate element that gets parsed next, we navigate forward in the node of a tree. To deal with this, the tag object provides the .find_previous_element and .find_next_element properties, as shown in the following example:

>>> soup.head.find_previous().name
'html'
>>> soup.head.find_next().name
'meta'
Modifying the Tree
The BeautifulSoup library also facilitates us to make changes to the web document according to our requirements. We can alter a tag's properties using its attributes, such as the .name, .string, and .append() method. We can also add new tags and strings to an existing tag with the help of the .new_string() and .new_tag() methods. There are also other methods, such as .insert(), .insert_before(), .insert_after(), and so on, to make various modifications to the document tree.

Here is an example of changing the title tag's .string attribute:

Before modifying the title tag the title contents are:
>>> soup.title.string
u'Chapter 6 - Web Scrapping with Python Requests and BeatuifulSoup'
This is the way to modify the contents of a title tag:
>>> soup.title.string = 'Web Scrapping with Python Requests and BeatuifulSoup by Balu and Rakhi'
After the modifications the contents of the tilte tag looks like this:
>>> soup.title.string
u'Web Scrapping with Python Requests and BeatuifulSoup by Balu and Rakhi'

In [None]:
Building a web scraping bot – a practical example
At this point of time, our minds got enlightened with all sorts of clues to scrape the Web. With all the information acquired, let's look at a practical example. Now, we will create a web scraping bot, which will pull a list of words from a web resource and store them in a JSON file.

Let's turn on the scraping mode!

The web scraping bot
Here, the web scraping bot is an automated script that has the capability to extract words from a website named majortests.com. This website consists of various tests and Graduate Record Examinations (GRE) word lists. With this web scraping bot, we will scrape the previously mentioned website and create a list of GRE words and their meanings in a JSON file.

The following image is the sample page of the website that we are going to scrape:

The web scraping bot
Before we kick start the scraping process, let's revise the dos and don't of web scraping as mentioned in the initial part of the chapter. Believe it or not they will definitely leave us in peace:

Do refer to the terms and conditions: Yes, before scraping majortests.com, refer to the terms and conditions of the site and obtain the necessary legal permissions to scrape it.
Don't bombard the server with a lot of requests: Keeping this in mind, for every request that we are going to send to the website, a delay has been instilled using Python's time.sleep function.
Do track the web resource from time to time: We ensured that the code runs perfectly with the website that is running on the server. Do check the site once before starting to scrape, so that it won't break the code. This can be made possible by running some unit tests, which conform to the structure we expected.

In [None]:
Now, let's start the implementation by following the steps to scrape that we discussed previously.

IDENTIFYING THE URL OR URLS
The first step in web scraping is to identify the URL or a list of URLs that will result in the required resources. In this case, our intent is to find all the URLs that result in the expected list of GRE words. The following is the list of the URLs of the sites that we are going to scrape:

http://www.majortests.com/gre/wordlist_01,

http://www.majortests.com/gre/wordlist_02,

http://www.majortests.com/gre/wordlist_03, and so on

Our aim is to scrape words from nine such URLs, for which we found a common pattern. This will help us to crawl all of them. The common URL pattern for all those URLs is written using Python's string object, as follows:

http://www.majortests.com/gre/wordlist_0%d

In our implementation, we defined a method called generate_urls, which will generate the required list of URLs using the preceding URL string. The following snippet demonstrates the process in a Python shell:

>>> START_PAGE, END_PAGE = 1, 10
>>> URL = "http://www.majortests.com/gre/wordlist_0%d"
>>> def generate_urls(url, start_page, end_page):
...     urls = []
...     for page in range(start_page, end_page):
...         urls.append(url % page)
...     return urls
...
>>> generate_urls(URL, START_PAGE, END_PAGE)
['http://www.majortests.com/gre/wordlist_01', 'http://www.majortests.com/gre/wordlist_02', 'http://www.majortests.com/gre/wordlist_03', 'http://www.majortests.com/gre/wordlist_04', 'http://www.majortests.com/gre/wordlist_05', 'http://www.majortests.com/gre/wordlist_06', 'http://www.majortests.com/gre/wordlist_07', 'http://www.majortests.com/gre/wordlist_08', 'http://www.majortests.com/gre/wordlist_09']
USING AN HTTP CLIENT
We will use the requests module as an HTTP client to get the web resources:

>>> import requests
>>> def get_resource(url):
...     return requests.get(url)
...
>>> get_resource("http://www.majortests.com/gre/wordlist_01")
<Response [200]>
In the preceding code, the get_resource function takes url as an argument and uses the requests module to get the 
resource.

In [None]:
Building a web scraping bot – a practical example
At this point of time, our minds got enlightened with all sorts of clues to scrape the Web. With all the information acquired, let's look at a practical example. Now, we will create a web scraping bot, which will pull a list of words from a web resource and store them in a JSON file.

Let's turn on the scraping mode!

The web scraping bot
Here, the web scraping bot is an automated script that has the capability to extract words from a website named majortests.com. This website consists of various tests and Graduate Record Examinations (GRE) word lists. With this web scraping bot, we will scrape the previously mentioned website and create a list of GRE words and their meanings in a JSON file.

The following image is the sample page of the website that we are going to scrape:

The web scraping bot
Before we kick start the scraping process, let's revise the dos and don't of web scraping as mentioned in the initial part of the chapter. Believe it or not they will definitely leave us in peace:

Do refer to the terms and conditions: Yes, before scraping majortests.com, refer to the terms and conditions of the site and obtain the necessary legal permissions to scrape it.
Don't bombard the server with a lot of requests: Keeping this in mind, for every request that we are going to send to the website, a delay has been instilled using Python's time.sleep function.
Do track the web resource from time to time: We ensured that the code runs perfectly with the website that is running on the server. Do check the site once before starting to scrape, so that it won't break the code. This can be made possible by running some unit tests, which conform to the structure we expected.
Now, let's start the implementation by following the steps to scrape that we discussed previously.

IDENTIFYING THE URL OR URLS
The first step in web scraping is to identify the URL or a list of URLs that will result in the required resources. In this case, our intent is to find all the URLs that result in the expected list of GRE words. The following is the list of the URLs of the sites that we are going to scrape:

http://www.majortests.com/gre/wordlist_01,

http://www.majortests.com/gre/wordlist_02,

http://www.majortests.com/gre/wordlist_03, and so on

Our aim is to scrape words from nine such URLs, for which we found a common pattern. This will help us to crawl all of them. The common URL pattern for all those URLs is written using Python's string object, as follows:

http://www.majortests.com/gre/wordlist_0%d

In our implementation, we defined a method called generate_urls, which will generate the required list of URLs using the preceding URL string. The following snippet demonstrates the process in a Python shell:



In [None]:
>>> START_PAGE, END_PAGE = 1, 10
>>> URL = "http://www.majortests.com/gre/wordlist_0%d"
>>> def generate_urls(url, start_page, end_page):
...     urls = []
...     for page in range(start_page, end_page):
...         urls.append(url % page)
...     return urls
...
>>> generate_urls(URL, START_PAGE, END_PAGE)
['http://www.majortests.com/gre/wordlist_01', 'http://www.majortests.com/gre/wordlist_02', 'http://www.majortests.com/gre/wordlist_03', 'http://www.majortests.com/gre/wordlist_04', 'http://www.majortests.com/gre/wordlist_05', 'http://www.majortests.com/gre/wordlist_06', 'http://www.majortests.com/gre/wordlist_07', 'http://www.majortests.com/gre/wordlist_08', 'http://www.majortests.com/gre/wordlist_09']

In [None]:
USING AN HTTP CLIENT
We will use the requests module as an HTTP client to get the web resources:

>>> import requests
>>> def get_resource(url):
...     return requests.get(url)
...
>>> get_resource("http://www.majortests.com/gre/wordlist_01")
<Response [200]>
In the preceding code, the get_resource function takes url as an argument and uses the requests module to get the resource.

DISCOVERING THE PIECES OF DATA TO SCRAPE
Now, it is time to analyze and classify the contents of the web page. The content in this context is a list of words with their definitions. In order to identify the elements of the words and their definitions, we used Chrome DevTools. The perceived information of the elements (HTML elements) can help us to identify the word and its definition, which can be used in the process of scraping.

To carry this out open the URL (http://www.majortests.com/gre/wordlist_01) in the Chrome browser and access the Inspect element option by right-clicking on the web page:

Discovering the pieces of data to scrape
From the preceding image, we can identify the structure of the word list, which appears in the following manner:

<div class="grid_9 alpha">
  <h3>Group 1</h3>
  <a name="1"></a>
  <table class="wordlist">
    <tbody>
      <tr>
        <th>Abhor</th>
        <td>hate</td>
      </tr>
      <tr>
        <th>Bigot</th>
        <td>narrow-minded, prejudiced person</td>
      </tr>
      ...
      ...
    </tbody>
  </table>
</div>
By looking at the parts of the previously referred to web page, we can interpret the following:

Each web page consists of a word list
Every word list has many word groups that are defined in the same div tag
All the words in a word group are described in a table having the class attribute—wordlist
Each and every table row (tr) in the table represents a word and its definition using the th and td tags, respectively
UTILIZING A WEB SCRAPING TOOL
Let's use BeautifulSoup4 as a web scraping tool to parse the obtained web page contents that we received using the requests module in one of the previous steps. By following the preceding interpretations, we can direct BeautifulSoup to access the required content of the web page and deliver it as an object:

def make_soup(html_string):
    return BeautifulSoup(html_string)
In the preceding lines of code, the make_soup method takes the html content in the form of a string and returns a BeautifulSoup object.

DRAWING THE DESIRED DATA
The BeautifulSoup object that we obtained in the previous step is used to extract the required words and their definitions from it. Now, with the methods available in the BeautifulSoup object, we can navigate through the obtained HTML response, and then we can extract the list of words and their definitions:

def get_words_from_soup(soup):
    words = {}

    for count, wordlist_table in enumerate(
    soup.find_all(class_='wordlist')):

        title = "Group %d" % (count + 1)

        new_words = {}
        for word_entry in wordlist_table.find_all('tr'):
            new_words[word_entry.th.text] = word_entry.td.text

        words[title] = new_words

    return words
In the preceding lines of code, get_words_from_soup takes a BeautifulSoup object and then looks for all the words contained in the wordlists class using the instance's find_all() method, and then returns a dictionary of words.

The dictionary of words obtained previously will be saved in a JSON file using the following helper method:

def save_as_json(data, output_file):
    """ Writes the given data into the specified output file"""
    with open(output_file, 'w') as outfile:
        json.dump(data, outfile)
On the whole, the process can be depicted in the following program:

import json
import time

import requests

from bs4 import BeautifulSoup

START_PAGE, END_PAGE, OUTPUT_FILE = 1, 10, 'words.json'

# Identify the URL
URL = "http://www.majortests.com/gre/wordlist_0%d"


def generate_urls(url, start_page, end_page):
    """
    This method takes a 'url' and returns a generated list of url strings

        params: a 'url', 'start_page' number and 'end_page' number
        return value: a list of generated url strings
    """
    urls = []
    for page in range(start_page, end_page):
        urls.append(url % page)
    return urls



def get_resource(url):
    """
    This method takes a 'url' and returns a 'requests.Response' object

        params: a 'url'
        return value: a 'requests.Response' object
    """
    return requests.get(url)


def make_soup(html_string):
    """
    This method takes a 'html string' and returns a 'BeautifulSoup' object

        params: html page contents as a string
        return value: a 'BeautifulSoup' object
    """
    return BeautifulSoup(html_string)


def get_words_from_soup(soup):

    """
    This method extracts word groups from a given 'BeautifulSoup' object

        params: a BeautifulSoup object to extract data
        return value: a dictionary of extracted word groups
    """

    words = {}
    count = 0

    for wordlist_table in soup.find_all(class_='wordlist'):

        count += 1
        title = "Group %d" % count

        new_words = {}
        for word_entry in wordlist_table.find_all('tr'):
            new_words[word_entry.th.text] = word_entry.td.text

        words[title] = new_words
        print " - - Extracted words from %s" % title

    return words


def save_as_json(data, output_file):
    """ Writes the given data into the specified output file"""
            json.dump(data, open(output_file, 'w'))


def scrapper_bot(urls):
    """
    Scrapper bot:
        params: takes a list of urls

        return value: a dictionary of word lists containing
                      different word groups
    """

    gre_words = {}
    for url in urls:

        print "Scrapping %s" % url.split('/')[-1]

        # step 1

        # get a 'url'

        # step 2
        html = requets.get(url)

        # step 3
        # identify the desired pieces of data in the url using Browser tools

        #step 4
        soup = make_soup(html.text)

        # step 5
        words = get_words_from_soup(soup)

        gre_words[url.split('/')[-1]] = words

        print "sleeping for 5 seconds now"
        time.sleep(5)

    return gre_words

if __name__ == '__main__':

    urls = generate_urls(URL, START_PAGE, END_PAGE+1)

    gre_words = scrapper_bot(urls)

    save_as_json(gre_words, OUTPUT_FILE)
Here is the content of the words.json file:

{"wordlist_04":
    {"Group 10":
        {"Devoured": "greedily eaten/consumed",
         "Magnate": "powerful businessman",
         "Cavalcade": "procession of vehicles",
         "Extradite": "deport from one country back to the home...
    .
    .
    .
}

In [None]:
Summary
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
In this chapter, you learned about different types of data that we encountered with web sources and tweaked some 
ideas. We came to know about the need for web scraping, the legal issues, and the goodies that it offers. Then, we
jumped deep into web scraping tasks and their potential. You learned about a new library called BeautifulSoup, and
its ins and outs, with examples.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
We came to know the capabilities of BeautifulSoup in depth and worked on some examples to get a clear idea on it.
At last, we created a practical scraping bot by applying the knowledge that we gained from the previous sections,
which enlightened us with an experience to scrape a website in real time.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
In the next chapter, you will learn about the Flask microframework and we will build an application using it by following the best practices.