In [16]:
# Import Splinter and BeautifulSoup
from splinter import Browser
from bs4 import BeautifulSoup as soup
from webdriver_manager.chrome import ChromeDriverManager


# Initialize Chrome Brower in Splinter

In [17]:
# Set the executable path and initialize chrome browser in splinter
executable_path = {'executable_path': ChromeDriverManager().install()}
browser = Browser('chrome', **executable_path, headless=False)

[WDM] - Current google-chrome version is 86.0.4240
[WDM] - Get LATEST driver version for 86.0.4240
[WDM] - Driver [C:\Users\Stephen\.wdm\drivers\chromedriver\win32\86.0.4240.22\chromedriver.exe] found in cache


 


# Scrape Data from Website

Before we start with the code, we'll want to use the DevTools to look at the details of this line. Right-click the webpage and select "Inspect." From the DevTools window, we can actually select an element on the page instead of searching through the tags.

First, select the inspect icon (the one to the far left).

![image.png](attachment:image.png)

Then, click the element you want to select on the page, such as the humor tag. This will direct your DevTools to the line of code the humor tag is nested in.

![image.png](attachment:image.png)

In [9]:
# Visit the Quotes to Scrape site
# This code tells Splinter which site we want to visit by assigning the link to a URL.
url = 'http://quotes.toscrape.com/'
browser.visit(url)

In [10]:
# Use BeautifulSoup to Parse the HTML
# BeautifulSoup parses the HTML and then stores it as an object.
# html.parser is being used to parse the information,but there are other options
html = browser.html
html_soup = soup(html, 'html.parser')

# Scrape the Title

1. We used our html_soup object we created earlier and chained find() to it to search for the <h2 /> tag.
2. We've also extracted only the text within the HTML tags by adding .text to the end of the code.

In [11]:
# Scrape the Title
title = html_soup.find('h2').text
title

'Top Ten tags'

# Scrape All of the Tags

Using our DevTools again, look at the code for the tags. We want all of the tags instead of just one, so we want to first use our select tool to highlight the <div /> container that holds all of the tags.

![image.png](attachment:image.png)

Notice that the <div /> container holding all of the tags has two classes. The col-md-4 class is a Bootstrap feature. Bootstrap is an HTML and CSS framework that simplifies adding functional components that look nice by default. In this case, col-md-4 means that this webpage is using a grid layout, and it's a common class that many webpages use. We'll dive into that more later.

The other class, tags-box, looks custom, though. Let's make sure first by searching for it using our search box.

![image.png](attachment:image.png)


After searching for tags-box, we can see that only one result is returned. This means that it's unique in the HTML and can be used to locate specific data. Next, expand the tags-box div to take a look at the contents.

From here, we can see a list of <span /> elements, each with a class of tag-item. Open some of the <span /> elements to see what they contain; if you see <a /> elements with the names in the list that we're targeting, then we're in the right place.

Since there are 10 items in the list displayed in the browser, let's use the dev tools' search function to verify the list item count. Search for tag-item and note the number of returned results. If there are 10, then we're ready to go.

![image.png](attachment:image.png)

In [None]:
# Scrape the top ten tags
tag_box = html_soup.find('div', class_='tags-box')
# tag_box
tags = tag_box.find_all('a', class_='tag')

for tag in tags:
    word = tag.text
    print(word)


This code looks really similar to our last, but we've increased the difficulty a bit by incorporating a for loop, but let's start at the beginning.

The first line, tag_box = html_soup.find('div', class_='tags-box'), creates a new variable tag_box, which will be used to store the results of a search. In this case, we're looking for <div /> elements with a class of tags-box, and we're searching for it in the HTML we parsed earlier and stored in the html_soup variable.

The second line, tags = tag_box.find_all('a', class_='tag'), is similar to the first but with a few tweaks to make the search more specific. The new "tags" variable will hold the results of a find_all, but this time we're searching through the parsed results stored in our tag_box variable to find <a /> elements with a tag class.

We used find_all this time because we want to capture all results, instead of a single or specific one.

Next, we've added a for loop. This for loop cycles through each tag in the tags variable, strips the HTML code out of it, and then prints only the text of each tag.

# Scrape Across Pages

In [13]:
# Assign URL to the varialbe and then tell Splinter to visit the webpage.
url = 'http://quotes.toscrape.com/'
browser.visit(url)

In [14]:
# Create a for loop to collect each quote, "click" the next button, then collect the next set of quotes.
# We'll use range(1, 6) in our for loop to visit the first five pages of the website.
for x in range(1, 6):
   html = browser.html
   quote_soup = soup (html, 'html.parser')
   quotes = quote_soup.find_all('span', class_='text')
   for quote in quotes:
      print('page:', x, '----------')
      print(quote.text)
   browser.links.find_by_partial_text('Next')

page: 1 ----------
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
page: 1 ----------
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
page: 1 ----------
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
page: 1 ----------
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
page: 1 ----------
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
page: 1 ----------
“Try not to become a man of success. Rather become a man of value.”
page: 1 ----------
“It is better to be hated for what you are than to be loved for what you are not.”
page: 1 ----------
“I have not failed. I've just found 10,000 ways that won't work.”
page: 1 ----------
“A woman is like a tea bag; you never know how strong it is u

# It's important to note that there are many ways that BeautifulSoup can search for text, but the syntax is typically the same: we look for a tag first, then an attribute. 


# We can search for items using only a tag, such as a <span /> or <h1 />, but a class or id attribute makes the search that much more specific.

In [15]:
browser.quit()